Alert Notifications
- Last Updated 3/9/2021, 11:49:41 AM UTC
- About 10 min read
# Challenges
# Alert notification multiplicity
How many distinct notifications does the system generate in response to a sequence of alerts for the same event? This depends on the context of the alert itself. For example when monitoring disk utilization and an alert is generated when a utilization threshold has been exceeded, then obviously the same alert instance is in effect until the utilization has gone below the monitoring threshold. This means the system should not issue a new alert notification for subsequent observations once the first notification has been emitted. On the other hand, when monitoring something like the status for an application, each time an application goes down we might want to consider successive alerts as separate notification events.
# Alert notification flapping
Often, an observation may fluctuate above and below alerting thresholds. The system needs to somehow identify this flapping behavior and not overwhelm you with false or repeated notifications.
# Alert Clearing
How does the system know if an alert is no longer firing?
- Metric observations are not tagged as alerts. For example,
/myrmex/system/disk/fs/mount
observation is below alert threshold. This does not apply for all types of observations; for example an observation which is emitted only when an entry is found in a log file. - Timeout, An alert for a specific observation does not arrive for a specified amount of time.
- External, An external actor explicitly tells the system that an alert is no longer active.
# Notification Throttling Configuration
# Hold Duration
Helps decide whether this is really an alert before sending a notification. It accumulates the alerts over a period of time and at the end of this period it decides whether to send an alert or not based on the ratio of observations tagged as alerts compared to those that are not tagged as alerts. A ratio value of 1 indicates that as soon we observe a non-alert event we suppress the notification. A ratio value of 0 implies that there should exist one or more alert tagged observations in order to send a notification.
Example #1:
- Trigger Ratio: 0.5
- #alert tags: 3
- #no alert tags: 1
- 3/4 >= 0.5 => notify
Example #2:
- Trigger Ratio: 1
- As soon as an observation with no alert tag is received the notification is suppressed.
# Expires Duration
Determines the time span of an alert notification hence reducing their multiplicity. Once a notification has been issued (hold duration has been exceeded) then the alert enters the active
state. The active
state is maintained until the configured expires
duration has elapsed:
- If an observation tagged as alert is received during this period then the
expires
duration is extended by the same amount of time configured above. No new notification is sent unless a configurable amount of time has elapsed since the last notification. - At the end of the
expires
duration, the alert is considered no longer active. New observations tagged as alert will enter the holding state described above and will be considered as new alert notifications.
# Re-notification Interval
How long to wait before sending a re-notification for the same alert while in active
state.
# Defaults
Hold Duration
: 2mExpires Duration
: 5mRenotification Interval
: 10mTrigger Ratio
: 1
# Zero Values
Hold Duration
: notify immediatelyExpires Duration
: treat all alerts as newRenotification Interval
: renotify immediatelyTrigger Ratio
: at least one previous alert
# Example Configurations
# Example #1
Hold Duration
: 1mTrigger Ratio
: 0.5
Expires Duration
: 30mRenotification Interval
: 10m
Time | Alert Tag | Notification | Timeout | State |
---|---|---|---|---|
00:00:00 | Yes | No | N/A | Hold |
00:00:10 | Yes | No | N/A | Hold |
00:00:20 | No | No | N/A | Hold |
00:00:30 | Yes | No | N/A | Hold |
00:00:40 | No | No | N/A | Hold |
00:00:50 | Yes | No | N/A | Hold |
00:01:00 | Yes | Yes | 00:31:00 | Active |
00:01:10 | Yes | No | 00:31:10 | Active |
00:01:20 | Yes | No | 00:31:20 | Active |
00:01:30 | No | No | 00:31:20 | Active |
00:01:40 | No | No | 00:31:20 | Active |
... | ... | ... | ... | ... |
00:11:00 | Yes | Yes | 00:41:00 | Active |
... | ... | ... | ... | ... |
00:41:00 | No | No | N/A | N/A |
... | ... | ... | ... | ... |
01:00:00 | Yes | No | N/A | Hold |
01:00:10 | Yes | No | N/A | Hold |
01:00:20 | No | No | N/A | Hold |
01:00:30 | Yes | No | N/A | Hold |
01:00:40 | No | No | N/A | Hold |
01:00:50 | Yes | No | N/A | Hold |
01:01:00 | Yes | Yes | 01:31:00 | Active |
01:01:10 | Yes | No | 01:31:10 | Active |
01:01:20 | Yes | No | 01:31:20 | Active |
01:01:30 | No | No | 01:31:20 | Active |
01:01:40 | No | No | 01:31:20 | Active |
... | ... | ... | ... | ... |
# Example #2
Hold Duration
: 0mTrigger Ratio
: n/a
Expires Duration
: 30mRenotification Interval
: 10m
Time | Alert Tag | Notification | Timeout | State |
---|---|---|---|---|
00:00:00 | Yes | Yes | 00:30:00 | Active |
00:00:10 | Yes | No | 00:30:10 | Active |
00:00:20 | No | No | 00:30:10 | Active |
00:00:30 | Yes | No | 00:30:30 | Active |
00:00:40 | No | No | 00:30:30 | Active |
00:00:50 | Yes | No | 00:30:50 | Active |
00:01:00 | Yes | No | 00:31:00 | Active |
00:01:10 | Yes | No | 00:31:10 | Active |
00:01:20 | Yes | No | 00:31:20 | Active |
00:01:30 | No | No | 00:31:20 | Active |
00:01:40 | No | No | 00:31:20 | Active |
... | ... | ... | ... | ... |
00:10:00 | Yes | Yes | 00:40:00 | Active |
... | ... | ... | ... | ... |
00:40:00 | No | No | N/A | N/A |
# Unique identification for alerts
Alerts are uniquely identified by the following properties:
- metric namespace, e.g.
/myrmex/system/disk/fs/mount
- the metric source, e.g.
myhost.mydomain.com
- the metric dimensions, e.g.
['/user', '/dev/sda1']
- the timestamp that the alert was initiated
When sending an alert notification to a 3rd party system we use these IDs to uniquely identify the alert so that the system can expose integration endpoints which allow actions such as:
- Confirmation of receipt
- Workflow milestones from 3rd party system, for example:
- someone is working on this alert so do not notify again
- alert has be resolved so clear its state
- Send re-notifications
- Send cancellations when the alert is no longer firing
- Fetch alert details
# Notification templates
Notification templates must be stored in your git
repo under path <assets root>/notifications/templates/
. They can be organized under any directory structure under this path. The Arisant distribution repo https://github.com/arisant/myrmex-dist.git
contains default templates for email and Jira notifications. Template file format is based on golang
text templates. For full template specification details see Golang Templates (opens new window).
The following variables are available for use inside your templates:
Name | Type | Description |
---|---|---|
Name | string | name of alert |
AlertID | string | unique identifier for the alert |
Metric | string | fully qualified name of metric that raised alert |
Source | string | the system that generated the alert. e.g. hostname |
Time | int64 | unix time (UTC seconds since 1/1/1970) that the alert was raised |
RuleExpr | string | rule expression that triggered the alert |
MetricDimensions | list | list of metric dimension Name/Value pairs |
MetricReadings | list | list of metric reading Name/Value pairs |
AgentHostname | string | hostname of the agent that generated this alert |
AgentLabel | string | label that was assigned to the agent that generated the alert |
Tenant | string | tenant name assigned to agent |
MetricTags | list | metric tags as list Name/Value tuples |
The following functions are available for use inside your templates:
Name | Arguments | Description |
---|---|---|
Env | var, default | returns the value of environment variable var or default if not found |
TagValue | alert, tagName, default | returns the value of a tag from alert data tags KV array. if the tag name is not found defaultValue is returned |
FmtUnixTime | timestamp | formats a unix time into RFC3339 |
DimAlias | alert, dimName, dimValue | returns the alias for dimension dimName of data.Metric with value dimValue or "" if no alias exists |
DimAliasWithValue | alert, dimName, dimValue, format | formats dim alias and dim value with format if the alias exists. Otherwise returns the dimension value. format is called with 2 arguments, the first is the alias and the second is the value. The format string specification must use golang fmt package verbs (opens new window) |
WhiteList | wl ...string, MetricDimensions|MetricReadings|MetricTags | returns a KV array with keys contained in wl |
BlackList | bl ...string, MetricDimensions|MetricReadings|MetricTags | returns a KV array with keys not contained in bl |
Slack | string | Escapes slack special characters |
# DimAlias
Examples
{{$data := .}}
{{range .MetricDimensions}}
// print ".Name: .Value (alias)"
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
// url: www.some.com/very/long/url?with_params=param (Super API)
//
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
// url: www.some.com/very/long/url?with_params=other ()
- {{.Name}}: {{.Value}} ({{DimAlias $data .Name .Value}})
// print ".Name: alias" if alias exists, otherwise ".Name: .Value"
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
// url: Super API
//
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
// url: www.some.com/very/long/url?with_params=other
- {{$a := DimAlias $data .Name .Value}}{{.Name}}: {{or $a .Value}}
// print ".Name: alias (.Value)" if dim alias exists, otherwise ".Name: .Value"
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
// url: Super API (www.some.com/very/long/url?with_params=param)
//
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
// url: www.some.com/very/long/url?with_params=other
- {{$a := DimAlias $data .Name .Value}}{{or (and $a (printf "%s (%s)" $a .Value)) .Value}}
{{end}}
# DimAliasWithValue
Examples
{{$data := .}}
{{range .MetricDimensions}}
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
// url: Super API (www.some.com/very/long/url?with_params=param)
//
// - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
// url: www.some.com/very/long/url?with_params=other
- {{.Name}}: {{DimAliasWithValue $data .Name .Value "%s (%s)"}}
{{end}}
# Jira
Your template file should contain the following templates corresponding to the various Jira ticket fields or enumerators:
Name | Required | Purpose |
---|---|---|
summary | Yes | text to appear in summary field |
description | Yes | text to appear in description field |
type | Yes | issue type the issue will opened as |
priority | Yes | priority the issue will opened as |
reopen-comment | Yes | the comment to add when re-opening an existing issue |
renotify-comment | Yes | the comment to add when renotifying on an existing issue |
labels | Yes | the labels to assign when opening an issue |
id | No | the custom field name where the alert unique id will be stored. This will be used to check for closed issues that match the alert id |
Example
{{define "type"}}Bug{{end}}
{{define "id"}}customfield_12038{{end}}
{{define "priority"}}{{TagValue . "severity" "Medium - Sev 3"}}{{end}}
{{define "labels"}}DB, ETL, mrx{{end}}
{{define "summary"}}{{.Name}}{{end}}
{{define "description"}}
h2. Summary
*source*: {{.Source}}
*alert*: {{.Name}}
*metric*: {{.Metric}}
*time*: {{FmtUnixTime .Time}}
*rule*: {{`{{`}}{{.RuleExpr}}{{`}}`}}
*agent*: {{.AgentHostname}}
h2. Details
{{range .MetricDimensions}}
- {{.Name}}:{{.Value}}
{{end}}
{{range .MetricReadings}}
- {{.Name}}:{{.Value}}
{{end}}
{{end}}
{{define "reopen-comment"}}re opening {{end}}
{{define "renotify-comment"}}still firing{{end}}
Your template file should contain the following templates:
Name | Required | Purpose |
---|---|---|
subject | Yes | text to appear in email subject |
body | Yes | text to appear in email body |
Example:
{{- /* set subject to "[customer][priority][source] alert":
customer from CUSTOMER env var or "no-conf"
priority from alert tag severity or env var NOTIFICATION_PRIORITY or "Sev 3"
source from alert data
alert from alert data
*/ -}}
{{define "subject"}}
{{$priority := Env "NOTIFICATION_PRIORITY" "Sev 3"}}
{{- "["}}{{Env "CUSTOMER" "no-conf"}}]
{{- "["}}{{TagValue . "severity" $priority}}]
{{- "["}}{{.Source}}]
{{- " "}}{{.Name -}}
{{end}}
{{define "body"}}
<!doctype html>
<html>
<head>
<meta name="viewport" content="width=device-width">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Alert Notification</title>
<style>
</style>
</head>
<body class="" style="background-color: #f6f6f6; font-family: sans-serif; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%;">
<h2>Summary</h2>
<b>source</b>: {{.Source}}
<b>alert</b>: {{.Name}}
<b>metric</b>: {{.Metric}}
<b>time</b>: {{FmtUnixTime .Time}}
<b>rule</b>: {{`{{`}}{{.RuleExpr}}{{`}}`}}
<b>agent</b>: {{.AgentHostname}}
<h2>Details</h2>
{{range .MetricDimensions}}
- {{.Name}}:{{.Value}}
{{end}}
{{range .MetricReadings}}
- {{.Name}}:{{.Value}}
{{end}}
{{end}}
</body>
</html>
{{end}}
# Slack
Slack messages are posted using Incoming Webhooks to channels in your slack workspace. Follow Sending messages using Incoming Webhooks (opens new window) to set setup a slack app and create webhooks for the channels that should receive alert notifications. The webhook URLs that you created should look similar to https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
and should be be configured inside your Slack notification routes.
Your template files should contain the following templates:
Name | Required | Purpose |
---|---|---|
message | Yes | The slack message template according to Formatting text in messages (opens new window) |
Example:
{{/* Sample Slack template */}}
{{define "message"}}
{
"attachments": [
{
"fallback": "{{.Name}}",
"color": "#36a64f",
"pretext": "Alert details",
"author_name": "Polaris Monitoring",
"title": "{{.Name}}",
"fields": [
{
"title": "Source",
"value": "{{.Source | Slack}}",
"short": false
},
{
"title": "Metric",
"value": "`{{.Metric | Slack}}`",
"short": false
},
{
"title": "Rule",
"type": "mrkdwn",
"value": "`{{.RuleExpr | Slack}}`",
"short": false
},
{
"title": "Details",
"type": "mrkdwn",
"value": "{{range BlackList .MetricDimensions "sensitive"}}\n>`{{.Name | Slack}}` {{.Value | CollapseNewLines "\\n>" | Slack}}{{end}}{{range WhiteList .MetricReadings "field-1" "field-2"}}\n>`{{.Name}}` {{.Value | CollapseNewLines "\\n>" | Slack}}{{end}}",
"short": false
}
],
"footer": "Polaris",
"footer_icon": "https://platform.slack-edge.com/img/default_application_icon.png",
"ts": {{.Time}}
}
]
}
{{end}}