Alert Notifications

  • Last Updated 3/9/2021, 11:49:41 AM UTC
  • About 10 min read

# Challenges

# Alert notification multiplicity

How many distinct notifications does the system generate in response to a sequence of alerts for the same event? This depends on the context of the alert itself. For example when monitoring disk utilization and an alert is generated when a utilization threshold has been exceeded, then obviously the same alert instance is in effect until the utilization has gone below the monitoring threshold. This means the system should not issue a new alert notification for subsequent observations once the first notification has been emitted. On the other hand, when monitoring something like the status for an application, each time an application goes down we might want to consider successive alerts as separate notification events.

# Alert notification flapping

Often, an observation may fluctuate above and below alerting thresholds. The system needs to somehow identify this flapping behavior and not overwhelm you with false or repeated notifications.

# Alert Clearing

How does the system know if an alert is no longer firing?

  • Metric observations are not tagged as alerts. For example, /myrmex/system/disk/fs/mount observation is below alert threshold. This does not apply for all types of observations; for example an observation which is emitted only when an entry is found in a log file.
  • Timeout, An alert for a specific observation does not arrive for a specified amount of time.
  • External, An external actor explicitly tells the system that an alert is no longer active.

# Notification Throttling Configuration

# Hold Duration

Helps decide whether this is really an alert before sending a notification. It accumulates the alerts over a period of time and at the end of this period it decides whether to send an alert or not based on the ratio of observations tagged as alerts compared to those that are not tagged as alerts. A ratio value of 1 indicates that as soon we observe a non-alert event we suppress the notification. A ratio value of 0 implies that there should exist one or more alert tagged observations in order to send a notification.

Example #1:

  • Trigger Ratio: 0.5
  • #alert tags: 3
  • #no alert tags: 1
  • 3/4 >= 0.5 => notify

Example #2:

  • Trigger Ratio: 1
  • As soon as an observation with no alert tag is received the notification is suppressed.

# Expires Duration

Determines the time span of an alert notification hence reducing their multiplicity. Once a notification has been issued (hold duration has been exceeded) then the alert enters the active state. The active state is maintained until the configured expires duration has elapsed:

  • If an observation tagged as alert is received during this period then the expires duration is extended by the same amount of time configured above. No new notification is sent unless a configurable amount of time has elapsed since the last notification.
  • At the end of the expires duration, the alert is considered no longer active. New observations tagged as alert will enter the holding state described above and will be considered as new alert notifications.

# Re-notification Interval

How long to wait before sending a re-notification for the same alert while in active state.

# Defaults

  • Hold Duration: 2m
  • Expires Duration: 5m
  • Renotification Interval: 10m
  • Trigger Ratio: 1

# Zero Values

  • Hold Duration: notify immediately
  • Expires Duration: treat all alerts as new
  • Renotification Interval: renotify immediately
  • Trigger Ratio: at least one previous alert

# Example Configurations

# Example #1

  • Hold Duration: 1m
    • Trigger Ratio: 0.5
  • Expires Duration: 30m
  • Renotification Interval: 10m
Time Alert Tag Notification Timeout State
00:00:00 Yes No N/A Hold
00:00:10 Yes No N/A Hold
00:00:20 No No N/A Hold
00:00:30 Yes No N/A Hold
00:00:40 No No N/A Hold
00:00:50 Yes No N/A Hold
00:01:00 Yes Yes 00:31:00 Active
00:01:10 Yes No 00:31:10 Active
00:01:20 Yes No 00:31:20 Active
00:01:30 No No 00:31:20 Active
00:01:40 No No 00:31:20 Active
... ... ... ... ...
00:11:00 Yes Yes 00:41:00 Active
... ... ... ... ...
00:41:00 No No N/A N/A
... ... ... ... ...
01:00:00 Yes No N/A Hold
01:00:10 Yes No N/A Hold
01:00:20 No No N/A Hold
01:00:30 Yes No N/A Hold
01:00:40 No No N/A Hold
01:00:50 Yes No N/A Hold
01:01:00 Yes Yes 01:31:00 Active
01:01:10 Yes No 01:31:10 Active
01:01:20 Yes No 01:31:20 Active
01:01:30 No No 01:31:20 Active
01:01:40 No No 01:31:20 Active
... ... ... ... ...

# Example #2

  • Hold Duration: 0m
    • Trigger Ratio: n/a
  • Expires Duration: 30m
  • Renotification Interval: 10m
Time Alert Tag Notification Timeout State
00:00:00 Yes Yes 00:30:00 Active
00:00:10 Yes No 00:30:10 Active
00:00:20 No No 00:30:10 Active
00:00:30 Yes No 00:30:30 Active
00:00:40 No No 00:30:30 Active
00:00:50 Yes No 00:30:50 Active
00:01:00 Yes No 00:31:00 Active
00:01:10 Yes No 00:31:10 Active
00:01:20 Yes No 00:31:20 Active
00:01:30 No No 00:31:20 Active
00:01:40 No No 00:31:20 Active
... ... ... ... ...
00:10:00 Yes Yes 00:40:00 Active
... ... ... ... ...
00:40:00 No No N/A N/A

# Unique identification for alerts

Alerts are uniquely identified by the following properties:

  • metric namespace, e.g. /myrmex/system/disk/fs/mount
  • the metric source, e.g. myhost.mydomain.com
  • the metric dimensions, e.g. ['/user', '/dev/sda1']
  • the timestamp that the alert was initiated

When sending an alert notification to a 3rd party system we use these IDs to uniquely identify the alert so that the system can expose integration endpoints which allow actions such as:

  • Confirmation of receipt
  • Workflow milestones from 3rd party system, for example:
    • someone is working on this alert so do not notify again
    • alert has be resolved so clear its state
  • Send re-notifications
  • Send cancellations when the alert is no longer firing
  • Fetch alert details

# Notification templates

Notification templates must be stored in your git repo under path <assets root>/notifications/templates/. They can be organized under any directory structure under this path. The Arisant distribution repo https://github.com/arisant/myrmex-dist.git contains default templates for email and Jira notifications. Template file format is based on golang text templates. For full template specification details see Golang Templates (opens new window).

The following variables are available for use inside your templates:

Name Type Description
Name string name of alert
AlertID string unique identifier for the alert
Metric string fully qualified name of metric that raised alert
Source string the system that generated the alert. e.g. hostname
Time int64 unix time (UTC seconds since 1/1/1970) that the alert was raised
RuleExpr string rule expression that triggered the alert
MetricDimensions list list of metric dimension Name/Value pairs
MetricReadings list list of metric reading Name/Value pairs
AgentHostname string hostname of the agent that generated this alert
AgentLabel string label that was assigned to the agent that generated the alert
Tenant string tenant name assigned to agent
MetricTags list metric tags as list Name/Value tuples

The following functions are available for use inside your templates:

Name Arguments Description
Env var, default returns the value of environment variable var or default if not found
TagValue alert, tagName, default returns the value of a tag from alert data tags KV array. if the tag name is not found defaultValue is returned
FmtUnixTime timestamp formats a unix time into RFC3339
DimAlias alert, dimName, dimValue returns the alias for dimension dimName of data.Metric with value dimValue or "" if no alias exists
DimAliasWithValue alert, dimName, dimValue, format formats dim alias and dim value with format if the alias exists. Otherwise returns the dimension value. format is called with 2 arguments, the first is the alias and the second is the value. The format string specification must use golang fmt package verbs (opens new window)
WhiteList wl ...string, MetricDimensions|MetricReadings|MetricTags returns a KV array with keys contained in wl
BlackList bl ...string, MetricDimensions|MetricReadings|MetricTags returns a KV array with keys not contained in bl
Slack string Escapes slack special characters

# DimAlias Examples

{{$data := .}}
{{range .MetricDimensions}}
  // print ".Name: .Value (alias)"
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
  //   url: www.some.com/very/long/url?with_params=param (Super API)  
  //
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
  //   url: www.some.com/very/long/url?with_params=other ()
- {{.Name}}: {{.Value}} ({{DimAlias $data .Name .Value}})

  // print ".Name: alias" if alias exists, otherwise ".Name: .Value"
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
  //   url: Super API
  //
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
  //   url: www.some.com/very/long/url?with_params=other
- {{$a := DimAlias $data .Name .Value}}{{.Name}}: {{or $a .Value}}

  // print ".Name: alias (.Value)" if dim alias exists, otherwise ".Name: .Value"
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
  //   url: Super API (www.some.com/very/long/url?with_params=param)
  //
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
  //   url: www.some.com/very/long/url?with_params=other
- {{$a := DimAlias $data .Name .Value}}{{or (and $a (printf "%s (%s)" $a .Value)) .Value}}
{{end}}

# DimAliasWithValue Examples

{{$data := .}}
{{range .MetricDimensions}}
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=param" and dim alias = "Super API", it will print:
  //   url: Super API (www.some.com/very/long/url?with_params=param)
  //
  // - given dim name = "url", dim value = "www.some.com/very/long/url?with_params=other" and no dim alias defined, it will print:
  //   url: www.some.com/very/long/url?with_params=other
- {{.Name}}: {{DimAliasWithValue $data .Name .Value "%s (%s)"}}
{{end}}

# Jira

Your template file should contain the following templates corresponding to the various Jira ticket fields or enumerators:

Name Required Purpose
summary Yes text to appear in summary field
description Yes text to appear in description field
type Yes issue type the issue will opened as
priority Yes priority the issue will opened as
reopen-comment Yes the comment to add when re-opening an existing issue
renotify-comment Yes the comment to add when renotifying on an existing issue
labels Yes the labels to assign when opening an issue
id No the custom field name where the alert unique id will be stored. This will be used to check for closed issues that match the alert id

Example

{{define "type"}}Bug{{end}}

{{define "id"}}customfield_12038{{end}}

{{define "priority"}}{{TagValue . "severity" "Medium - Sev 3"}}{{end}}

{{define "labels"}}DB, ETL, mrx{{end}}

{{define "summary"}}{{.Name}}{{end}}

{{define "description"}}
h2. Summary

*source*: {{.Source}}
*alert*: {{.Name}}
*metric*: {{.Metric}}
*time*: {{FmtUnixTime .Time}}
*rule*: {{`{{`}}{{.RuleExpr}}{{`}}`}}
*agent*: {{.AgentHostname}}

h2. Details

{{range .MetricDimensions}}
- {{.Name}}:{{.Value}}
{{end}}
{{range .MetricReadings}}
- {{.Name}}:{{.Value}}
{{end}}
{{end}}

{{define "reopen-comment"}}re opening {{end}}

{{define "renotify-comment"}}still firing{{end}}

# Email

Your template file should contain the following templates:

Name Required Purpose
subject Yes text to appear in email subject
body Yes text to appear in email body

Example:

{{- /* set subject to "[customer][priority][source] alert":
         customer from CUSTOMER env var or "no-conf"
         priority from alert tag severity or env var NOTIFICATION_PRIORITY or "Sev 3"
         source from alert data
         alert from alert data
*/ -}}  
{{define "subject"}}
  {{$priority := Env "NOTIFICATION_PRIORITY" "Sev 3"}}
  {{- "["}}{{Env "CUSTOMER" "no-conf"}}]
  {{- "["}}{{TagValue . "severity" $priority}}]
  {{- "["}}{{.Source}}]
  {{- " "}}{{.Name -}}
{{end}}

{{define "body"}}
<!doctype html>
<html>
  <head>
    <meta name="viewport" content="width=device-width">
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Alert Notification</title>
    <style>
    </style>
  </head>
  <body class="" style="background-color: #f6f6f6; font-family: sans-serif; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%;">
<h2>Summary</h2>

<b>source</b>: {{.Source}}
<b>alert</b>: {{.Name}}
<b>metric</b>: {{.Metric}}
<b>time</b>: {{FmtUnixTime .Time}}
<b>rule</b>: {{`{{`}}{{.RuleExpr}}{{`}}`}}
<b>agent</b>: {{.AgentHostname}}

<h2>Details</h2>

{{range .MetricDimensions}}
- {{.Name}}:{{.Value}}
{{end}}
{{range .MetricReadings}}
- {{.Name}}:{{.Value}}
{{end}}
{{end}}
  </body>
</html>
{{end}}

# Slack

Slack messages are posted using Incoming Webhooks to channels in your slack workspace. Follow Sending messages using Incoming Webhooks (opens new window) to set setup a slack app and create webhooks for the channels that should receive alert notifications. The webhook URLs that you created should look similar to https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX and should be be configured inside your Slack notification routes.

Your template files should contain the following templates:

Name Required Purpose
message Yes The slack message template according to Formatting text in messages (opens new window)

Example:

{{/* Sample Slack template */}}


{{define "message"}}
{
    "attachments": [
        {
            "fallback": "{{.Name}}",
            "color": "#36a64f",
            "pretext": "Alert details",
            "author_name": "Polaris Monitoring",
            "title": "{{.Name}}",
            "fields": [
                {
                    "title": "Source",
                    "value": "{{.Source | Slack}}",
                    "short": false
                },
                {
                    "title": "Metric",
                    "value": "`{{.Metric | Slack}}`",
                    "short": false
                },
                {
                    "title": "Rule",
                    "type": "mrkdwn",
                    "value": "`{{.RuleExpr | Slack}}`",
                    "short": false
                },
                {
                    "title": "Details",
                    "type": "mrkdwn",
                    "value": "{{range BlackList .MetricDimensions "sensitive"}}\n>`{{.Name | Slack}}` {{.Value | CollapseNewLines "\\n>" | Slack}}{{end}}{{range WhiteList .MetricReadings "field-1" "field-2"}}\n>`{{.Name}}` {{.Value  | CollapseNewLines "\\n>" | Slack}}{{end}}",
                    "short": false
                }
            ],
            "footer": "Polaris",
            "footer_icon": "https://platform.slack-edge.com/img/default_application_icon.png",
            "ts": {{.Time}}
        }
    ]
}
{{end}}
Last Updated: 3/9/2021, 11:49:41 AM