Network endpoint health checks

Last Updated 3/31/2023, 12:34:01 PM UTC
About 12 min read

Plugin info

name: net-healthcheck

Performs active health checks against one or more network endpoints

Supported health checks are:
- TCP
- UDP
- ICMP
- HTTP
All checks are performed concurrently
An endpoint can be in one of the following states:
- UNKNOWN, when the system does not have enough information to deduce state
- HEALTHY
- UNHEALTHY
An endpoint starts in UNKNOWN state and then transitions between HEALTHY and UNHEALTHY:
- UNKNOWN -> HEALTHY, after success_threshold number of consecutive successful checks
- UNKNOWN -> UNHEALTHY, after failure_threshold number of consecutive failed checks
- HEALTHY -> UNHEALTHY, after failure_threshold number of consecutive failed checks
- UNHEALTHY -> HEALTHY, after success_threshold number of consecutive successful checks
TCP and UDP health checks support connect only checks and payload checks. With payload checks the plugin connects to the endpoint and sends a text or binary payload and receives an optional response from the endpoint. The response can then be compared against an expected text or binary value
HTTP health checks invoke URL endpoints and check the response HTTP status code and optionally the response body against expected values. Redirects are never followed.

# Prerequisites

ICMP checks on linux can be privileged or unprivileged (default). Privileged checks require the plugin to run as root (not recommended). Unprivileged checks require the plugin to be run using a gid specified in the net.ipv4.ping_group_range kernel parameter. Make sure that you restrict the net.ipv4.ping_group_range range to the minimum possible:

$### find your group ids
$ id polaris
uid=1001(polaris) gid=1001(polaris) groups=991(docker),1001(polaris),8001(pingers)
$### check current value of net.ipv4.ping_group_range (default is none)
$ sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 1 0
$### modify to allow polaris user unprivileged pings
$ sysctl net.ipv4.ping_group_range='8001 8001' 
$### persist changes
$ cat << 'EOF' >> /etc/sysctl.d/99-allow-ping.conf 
net.ipv4.ping_group_range=8001 8001
EOF

# Events

None

# Metrics

net/healthcheck/status, the current status for a health check. One of UNKNOWN, HEALTHY, UNHEALTHY. Each health check status is identified by
- name, the health check name
- origin, the hostname that the health check was originated from
- source, the health check target depending on the check type
  - host:port for tcp, udp
  - host for icmp
  - url for http

# TLDR

# list of named checks
checks:
  # checks host connectivity using a connect only tcp check
  - name: host connectivity               
    check: tcp                            # using tcp
    interval: 1m                          # check every minute
    timeout: 10s                          # bail out after 10s if no response
    failure_threshold: 5                  # set status to UNHEALTHY after 5 consecutive failures 
    success_threshold: 3                  # set status to HEALTHY after 3 consecutive successes
    port: 22                              # default port to connect to
    targets:                              # the hosts to check
      - host: host1.intl.net 
      - host: host2.intl.net    
        port: 443                         # hit port 443 instead of 22
        timeout: 30s                      # set timeout to 30s instead of 10s

  # ping check a bunch of hosts
  - name: host reachability
    check: icmp                           # using ICMP
    privileged: false                     # unprivileged icmp
    interval: 1m30s                       # every 90s
    timeout: 1s                           # with timeout 1s
    targets:                              # the hosts to ping
      - host: host1.intl.net
      - host: host2.intl.net

  # checks postgres connection startup without server starting a worker process 
  - name: postgres connection    
    check: tcp                            # using tcp
    interval: 1m                          # at 1m intervals
    timeout: 10s                          # with timeout 10s
    failure_threshold: 3                  # UNHEALTHY after 3 consecutive failures 
    success_threshold: 1                  # HEALTHY after 1 consecutive success 
    port: 5432                            # hit port 5432
    send: !!binary |                      # send SSL negotiation startup packet - 0000000804d2162f
      AAAACATSFi8=
    receive: S                            # must receive "S"
    targets:                              # the pg servers to check
      - host: prod-db1.intl.net
      - host: prod-db2.intl.net
      - host: prod-db3.intl.net

  # checks connection to DNS servers
  - name: DNS connectivity    
    check: udp                            # using udp     
    interval: 60s                         # at 60s intervals
    timeout: 5s                           # with 5s timeout
    failure_threshold: 5                  # UNHEALTHY after 5 consecutive failures
    success_threshold: 2                  # HEALTHY after 2 consecutive success 
    port: 53                              # on port 53
    targets:                              # the DNS servers to check
      - host: dns1.intl.net
      - host: dns2.intl.net
      - host: dns3.intl.net
  
  # checks connection to ERP cluster using HTTP
  - name: ERP Cluster  
    check: http                            # using http                           
    interval: 30s                          # at 30s intervals
    timeout: 5s                            # with 5s timeout
    failure_threshold: 10                  # UNHEALTHY after 10 consecutive failures
    success_threshold: 3                   # HEALTHY after 3 consecutive success 
    method: GET                            # use GET requests
    headers: 
      x-something: x-something-value       # add some required headers
    expected_statuses: 200-500             # must respond with HTTP 200 through to 499
    tls_skip_verify: true                  # do not validate SSL certs
    targets:                               # the ERP cluster members
      - url: https://erp-1.int.net/health
      - url: https://erp-2.int.net/health
      - url: https://erp-3.int.net/health

# Configuration

The plugin is configured with a list of named checks. Each check can target one more endpoints.

Name	Type	Required	Default	Description
`checks`	list[TCP\|UDP\|ICMP\|HTTP Check Configuration]	No		The health checks to perform

# TCP Check Configuration

Name	Type	Required	Default	Description
`name`	string	Yes		The name for the check is it will appear in the generated metrics
`check`	string	Yes		Set to tcp
`success_threshold`	int	No	1	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	1	The number of consecutive failed checks before the endpoint is considered unhealthy
`interval`	duration string	Yes		How often to run the health check
`timeout`	duration string	No	`10s`	How long to wait for check to complete
`secure`	boolean	No	false	Use TLS
`tls_skip_verify`	boolean	No	false	Do not perform TLS certification validation
`port`	int	No		TCP port to check
`send`	string	No		optional text or binary payload to send to endpoint
`receive`	string	No		if configured endpoint response must match this binary or text response
`targets`	list[TCP Check Target Configuration]	No		list of endpoints to check

# TCP Check Target Configuration

Any un-configured parameters will use the settings from parent TCP Check Configuration

Name	Type	Required	Description
`host`	string	Yes	The host to check
`port`	int	No	TCP port to check
`success_threshold`	int	No	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	The number of consecutive failed checks before the endpoint is considered unhealthy
`timeout`	duration string	No	How long to wait for check to complete
`secure`	boolean	No	Use TLS
`tls_skip_verify`	boolean	No	Do not perform TLS certification validation
`send`	string	No	optional text or binary payload to send to endpoint
`receive`	string	No	if configured endpoint response must match this binary or text response

# UDP Check Configuration

Name	Type	Required	Default	Description
`name`	string	Yes		The name for the check is it will appear in the generated metrics
`check`	string	Yes		Set to udp
`success_threshold`	int	No	1	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	1	The number of consecutive failed checks before the endpoint is considered unhealthy
`interval`	duration string	Yes		How often to run the health check
`timeout`	duration string	No	`5s`	How long to wait for check to complete
`port`	int	No		UDP port to check
`send`	string	No		optional text or binary payload to send to endpoint
`receive`	string	No		if configured endpoint response must match this binary or text response
`targets`	list[UDP Check Target Configuration]	No		list of endpoints to check

# UDP Check Target Configuration

Any un-configured parameters will use the settings from parent UDP Check Configuration

Name	Type	Required	Description
`host`	string	Yes	The host to check
`port`	int	No	UDP port to check
`success_threshold`	int	No	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	The number of consecutive failed checks before the endpoint is considered unhealthy
`timeout`	duration string	No	How long to wait for check to complete
`send`	string	No	optional text or binary payload to send to endpoint
`receive`	string	No	if configured endpoint response must match this binary or text response

# ICMP Check Configuration

Name	Type	Required	Default	Description
`name`	string	Yes		The name for the check is it will appear in the generated metrics
`check`	string	Yes		Set to icmp
`success_threshold`	int	No	1	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	1	The number of consecutive failed checks before the endpoint is considered unhealthy
`interval`	duration string	Yes		How often to run the health check
`timeout`	duration string	No	`1s`	How long to wait for check to complete
`privileged`	boolean	No	false	Use privileged ICMP
`targets`	list[ICMP Check Target Configuration]	No		list of endpoints to check

# ICMP Check Target Configuration

Any un-configured parameters will use the settings from parent ICMP Check Configuration

Name	Type	Required	Description
`host`	string	Yes	The host to check
`success_threshold`	int	No	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	The number of consecutive failed checks before the endpoint is considered unhealthy
`timeout`	duration string	No	How long to wait for check to complete
`privileged`	boolean	No	Use privileged ICMP

# HTTP Check Configuration

Name	Type	Required	Default	Description
`name`	string	Yes		The name for the check is it will appear in the generated metrics
`check`	string	Yes		Set to http
`success_threshold`	int	No	1	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	1	The number of consecutive failed checks before the endpoint is considered unhealthy
`interval`	duration string	Yes		How often to run the health check
`timeout`	duration string	No	`10s`	How long to wait for check to complete
`secure`	boolean	No	false	Use TLS
`tls_skip_verify`	boolean	No	false	Do not perform TLS certification validation
`method`	string	No	GET	One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE"
`headers`	map[string]string	No		http headers to pass to requests
`response`	string	No		if set, the http response body is expected to match this value
`expected_statuses`	string	No	200	Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401
`proxy`	string	No		optional http proxy url to use
`targets`	list[HTTP Check Target Configuration]	No		list of endpoints to check

# HTTP Check Target Configuration

Any un-configured parameters will use the settings from parent HTTP Check Configuration

Name	Type	Required	Description
`url`	string	Yes	the HTTP url to hit. Supports encrypted secrets
`success_threshold`	int	No	The number of consecutive successful checks before the endpoint is considered healthy
`failure_threshold`	int	No	The number of consecutive failed checks before the endpoint is considered unhealthy
`interval`	duration string	Yes	How often to run the health check
`timeout`	duration string	No	How long to wait for check to complete
`secure`	boolean	No	Use TLS
`tls_skip_verify`	boolean	No	Do not perform TLS certification validation
`method`	string	No	One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE"
`headers`	map[string]string	No	http headers to pass to requests
`response`	string	No	if set, the http response is expected to match this value
`expected_statuses`	string	No	Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401
`proxy`	string	No	optional http proxy url to use

← Network monitor TLS inspector →