Network endpoint health checks

  • Last Updated 3/31/2023, 12:34:01 PM UTC
  • About 12 min read

Plugin info

name: net-healthcheck

Performs active health checks against one or more network endpoints

  • Supported health checks are:
    • TCP
    • UDP
    • ICMP
    • HTTP
  • All checks are performed concurrently
  • An endpoint can be in one of the following states:
    • UNKNOWN, when the system does not have enough information to deduce state
    • HEALTHY
    • UNHEALTHY
  • An endpoint starts in UNKNOWN state and then transitions between HEALTHY and UNHEALTHY:
    • UNKNOWN -> HEALTHY, after success_threshold number of consecutive successful checks
    • UNKNOWN -> UNHEALTHY, after failure_threshold number of consecutive failed checks
    • HEALTHY -> UNHEALTHY, after failure_threshold number of consecutive failed checks
    • UNHEALTHY -> HEALTHY, after success_threshold number of consecutive successful checks
  • TCP and UDP health checks support connect only checks and payload checks. With payload checks the plugin connects to the endpoint and sends a text or binary payload and receives an optional response from the endpoint. The response can then be compared against an expected text or binary value
  • HTTP health checks invoke URL endpoints and check the response HTTP status code and optionally the response body against expected values. Redirects are never followed.

# Prerequisites

ICMP checks on linux can be privileged or unprivileged (default). Privileged checks require the plugin to run as root (not recommended). Unprivileged checks require the plugin to be run using a gid specified in the net.ipv4.ping_group_range kernel parameter. Make sure that you restrict the net.ipv4.ping_group_range range to the minimum possible:

$### find your group ids
$ id polaris
uid=1001(polaris) gid=1001(polaris) groups=991(docker),1001(polaris),8001(pingers)
$### check current value of net.ipv4.ping_group_range (default is none)
$ sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 1 0
$### modify to allow polaris user unprivileged pings
$ sysctl net.ipv4.ping_group_range='8001 8001' 
$### persist changes
$ cat << 'EOF' >> /etc/sysctl.d/99-allow-ping.conf 
net.ipv4.ping_group_range=8001 8001
EOF

# Events

None

# Metrics

  • net/healthcheck/status, the current status for a health check. One of UNKNOWN, HEALTHY, UNHEALTHY. Each health check status is identified by
    • name, the health check name
    • origin, the hostname that the health check was originated from
    • source, the health check target depending on the check type
      • host:port for tcp, udp
      • host for icmp
      • url for http

# TLDR

# list of named checks
checks:
  # checks host connectivity using a connect only tcp check
  - name: host connectivity               
    check: tcp                            # using tcp
    interval: 1m                          # check every minute
    timeout: 10s                          # bail out after 10s if no response
    failure_threshold: 5                  # set status to UNHEALTHY after 5 consecutive failures 
    success_threshold: 3                  # set status to HEALTHY after 3 consecutive successes
    port: 22                              # default port to connect to
    targets:                              # the hosts to check
      - host: host1.intl.net 
      - host: host2.intl.net    
        port: 443                         # hit port 443 instead of 22
        timeout: 30s                      # set timeout to 30s instead of 10s

  # ping check a bunch of hosts
  - name: host reachability
    check: icmp                           # using ICMP
    privileged: false                     # unprivileged icmp
    interval: 1m30s                       # every 90s
    timeout: 1s                           # with timeout 1s
    targets:                              # the hosts to ping
      - host: host1.intl.net
      - host: host2.intl.net

  # checks postgres connection startup without server starting a worker process 
  - name: postgres connection    
    check: tcp                            # using tcp
    interval: 1m                          # at 1m intervals
    timeout: 10s                          # with timeout 10s
    failure_threshold: 3                  # UNHEALTHY after 3 consecutive failures 
    success_threshold: 1                  # HEALTHY after 1 consecutive success 
    port: 5432                            # hit port 5432
    send: !!binary |                      # send SSL negotiation startup packet - 0000000804d2162f
      AAAACATSFi8=
    receive: S                            # must receive "S"
    targets:                              # the pg servers to check
      - host: prod-db1.intl.net
      - host: prod-db2.intl.net
      - host: prod-db3.intl.net

  # checks connection to DNS servers
  - name: DNS connectivity    
    check: udp                            # using udp     
    interval: 60s                         # at 60s intervals
    timeout: 5s                           # with 5s timeout
    failure_threshold: 5                  # UNHEALTHY after 5 consecutive failures
    success_threshold: 2                  # HEALTHY after 2 consecutive success 
    port: 53                              # on port 53
    targets:                              # the DNS servers to check
      - host: dns1.intl.net
      - host: dns2.intl.net
      - host: dns3.intl.net
  
  # checks connection to ERP cluster using HTTP
  - name: ERP Cluster  
    check: http                            # using http                           
    interval: 30s                          # at 30s intervals
    timeout: 5s                            # with 5s timeout
    failure_threshold: 10                  # UNHEALTHY after 10 consecutive failures
    success_threshold: 3                   # HEALTHY after 3 consecutive success 
    method: GET                            # use GET requests
    headers: 
      x-something: x-something-value       # add some required headers
    expected_statuses: 200-500             # must respond with HTTP 200 through to 499
    tls_skip_verify: true                  # do not validate SSL certs
    targets:                               # the ERP cluster members
      - url: https://erp-1.int.net/health
      - url: https://erp-2.int.net/health
      - url: https://erp-3.int.net/health

# Configuration

The plugin is configured with a list of named checks. Each check can target one more endpoints.

Name Type Required Default Description
checks list[TCP|UDP|ICMP|HTTP Check Configuration] No The health checks to perform

# TCP Check Configuration

Name Type Required Default Description
name string Yes The name for the check is it will appear in the generated metrics
check string Yes Set to tcp
success_threshold int No 1 The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No 1 The number of consecutive failed checks before the endpoint is considered unhealthy
interval duration string Yes How often to run the health check
timeout duration string No 10s How long to wait for check to complete
secure boolean No false Use TLS
tls_skip_verify boolean No false Do not perform TLS certification validation
port int No TCP port to check
send string No optional text or binary payload to send to endpoint
receive string No if configured endpoint response must match this binary or text response
targets list[TCP Check Target Configuration] No list of endpoints to check

# TCP Check Target Configuration

Any un-configured parameters will use the settings from parent TCP Check Configuration

Name Type Required Description
host string Yes The host to check
port int No TCP port to check
success_threshold int No The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No The number of consecutive failed checks before the endpoint is considered unhealthy
timeout duration string No How long to wait for check to complete
secure boolean No Use TLS
tls_skip_verify boolean No Do not perform TLS certification validation
send string No optional text or binary payload to send to endpoint
receive string No if configured endpoint response must match this binary or text response

# UDP Check Configuration

Name Type Required Default Description
name string Yes The name for the check is it will appear in the generated metrics
check string Yes Set to udp
success_threshold int No 1 The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No 1 The number of consecutive failed checks before the endpoint is considered unhealthy
interval duration string Yes How often to run the health check
timeout duration string No 5s How long to wait for check to complete
port int No UDP port to check
send string No optional text or binary payload to send to endpoint
receive string No if configured endpoint response must match this binary or text response
targets list[UDP Check Target Configuration] No list of endpoints to check

# UDP Check Target Configuration

Any un-configured parameters will use the settings from parent UDP Check Configuration

Name Type Required Description
host string Yes The host to check
port int No UDP port to check
success_threshold int No The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No The number of consecutive failed checks before the endpoint is considered unhealthy
timeout duration string No How long to wait for check to complete
send string No optional text or binary payload to send to endpoint
receive string No if configured endpoint response must match this binary or text response

# ICMP Check Configuration

Name Type Required Default Description
name string Yes The name for the check is it will appear in the generated metrics
check string Yes Set to icmp
success_threshold int No 1 The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No 1 The number of consecutive failed checks before the endpoint is considered unhealthy
interval duration string Yes How often to run the health check
timeout duration string No 1s How long to wait for check to complete
privileged boolean No false Use privileged ICMP
targets list[ICMP Check Target Configuration] No list of endpoints to check

# ICMP Check Target Configuration

Any un-configured parameters will use the settings from parent ICMP Check Configuration

Name Type Required Description
host string Yes The host to check
success_threshold int No The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No The number of consecutive failed checks before the endpoint is considered unhealthy
timeout duration string No How long to wait for check to complete
privileged boolean No Use privileged ICMP

# HTTP Check Configuration

Name Type Required Default Description
name string Yes The name for the check is it will appear in the generated metrics
check string Yes Set to http
success_threshold int No 1 The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No 1 The number of consecutive failed checks before the endpoint is considered unhealthy
interval duration string Yes How often to run the health check
timeout duration string No 10s How long to wait for check to complete
secure boolean No false Use TLS
tls_skip_verify boolean No false Do not perform TLS certification validation
method string No GET One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE"
headers map[string]string No http headers to pass to requests
response string No if set, the http response body is expected to match this value
expected_statuses string No 200 Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401
proxy string No optional http proxy url to use
targets list[HTTP Check Target Configuration] No list of endpoints to check

# HTTP Check Target Configuration

Any un-configured parameters will use the settings from parent HTTP Check Configuration

Name Type Required Description
url string Yes the HTTP url to hit. Supports encrypted secrets
success_threshold int No The number of consecutive successful checks before the endpoint is considered healthy
failure_threshold int No The number of consecutive failed checks before the endpoint is considered unhealthy
interval duration string Yes How often to run the health check
timeout duration string No How long to wait for check to complete
secure boolean No Use TLS
tls_skip_verify boolean No Do not perform TLS certification validation
method string No One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE"
headers map[string]string No http headers to pass to requests
response string No if set, the http response is expected to match this value
expected_statuses string No Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401
proxy string No optional http proxy url to use
Last Updated: 3/31/2023, 12:34:01 PM