Network endpoint health checks
- Last Updated 3/31/2023, 12:34:01 PM UTC
- About 12 min read
Plugin info
name: net-healthcheck
Performs active health checks against one or more network endpoints
- Supported health checks are:
- TCP
- UDP
- ICMP
- HTTP
- All checks are performed concurrently
- An endpoint can be in one of the following states:
- UNKNOWN, when the system does not have enough information to deduce state
- HEALTHY
- UNHEALTHY
- An endpoint starts in UNKNOWN state and then transitions between HEALTHY and UNHEALTHY:
- UNKNOWN -> HEALTHY, after
success_threshold
number of consecutive successful checks - UNKNOWN -> UNHEALTHY, after
failure_threshold
number of consecutive failed checks - HEALTHY -> UNHEALTHY, after
failure_threshold
number of consecutive failed checks - UNHEALTHY -> HEALTHY, after
success_threshold
number of consecutive successful checks
- UNKNOWN -> HEALTHY, after
- TCP and UDP health checks support connect only checks and payload checks. With payload checks the plugin connects to the endpoint and sends a text or binary payload and receives an optional response from the endpoint. The response can then be compared against an expected text or binary value
- HTTP health checks invoke URL endpoints and check the response HTTP status code and optionally the response body against expected values. Redirects are never followed.
# Prerequisites
ICMP checks on linux can be privileged or unprivileged (default). Privileged checks require the plugin to run as root (not recommended).
Unprivileged checks require the plugin to be run using a gid
specified in the net.ipv4.ping_group_range
kernel parameter.
Make sure that you restrict the net.ipv4.ping_group_range
range to the minimum possible:
$### find your group ids
$ id polaris
uid=1001(polaris) gid=1001(polaris) groups=991(docker),1001(polaris),8001(pingers)
$### check current value of net.ipv4.ping_group_range (default is none)
$ sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 1 0
$### modify to allow polaris user unprivileged pings
$ sysctl net.ipv4.ping_group_range='8001 8001'
$### persist changes
$ cat << 'EOF' >> /etc/sysctl.d/99-allow-ping.conf
net.ipv4.ping_group_range=8001 8001
EOF
# Events
None
# Metrics
net/healthcheck/status
, the current status for a health check. One ofUNKNOWN
,HEALTHY
,UNHEALTHY
. Each health check status is identified byname
, the health check nameorigin
, the hostname that the health check was originated fromsource
, the health check target depending on the check typehost:port
for tcp, udphost
for icmpurl
for http
# TLDR
# list of named checks
checks:
# checks host connectivity using a connect only tcp check
- name: host connectivity
check: tcp # using tcp
interval: 1m # check every minute
timeout: 10s # bail out after 10s if no response
failure_threshold: 5 # set status to UNHEALTHY after 5 consecutive failures
success_threshold: 3 # set status to HEALTHY after 3 consecutive successes
port: 22 # default port to connect to
targets: # the hosts to check
- host: host1.intl.net
- host: host2.intl.net
port: 443 # hit port 443 instead of 22
timeout: 30s # set timeout to 30s instead of 10s
# ping check a bunch of hosts
- name: host reachability
check: icmp # using ICMP
privileged: false # unprivileged icmp
interval: 1m30s # every 90s
timeout: 1s # with timeout 1s
targets: # the hosts to ping
- host: host1.intl.net
- host: host2.intl.net
# checks postgres connection startup without server starting a worker process
- name: postgres connection
check: tcp # using tcp
interval: 1m # at 1m intervals
timeout: 10s # with timeout 10s
failure_threshold: 3 # UNHEALTHY after 3 consecutive failures
success_threshold: 1 # HEALTHY after 1 consecutive success
port: 5432 # hit port 5432
send: !!binary | # send SSL negotiation startup packet - 0000000804d2162f
AAAACATSFi8=
receive: S # must receive "S"
targets: # the pg servers to check
- host: prod-db1.intl.net
- host: prod-db2.intl.net
- host: prod-db3.intl.net
# checks connection to DNS servers
- name: DNS connectivity
check: udp # using udp
interval: 60s # at 60s intervals
timeout: 5s # with 5s timeout
failure_threshold: 5 # UNHEALTHY after 5 consecutive failures
success_threshold: 2 # HEALTHY after 2 consecutive success
port: 53 # on port 53
targets: # the DNS servers to check
- host: dns1.intl.net
- host: dns2.intl.net
- host: dns3.intl.net
# checks connection to ERP cluster using HTTP
- name: ERP Cluster
check: http # using http
interval: 30s # at 30s intervals
timeout: 5s # with 5s timeout
failure_threshold: 10 # UNHEALTHY after 10 consecutive failures
success_threshold: 3 # HEALTHY after 3 consecutive success
method: GET # use GET requests
headers:
x-something: x-something-value # add some required headers
expected_statuses: 200-500 # must respond with HTTP 200 through to 499
tls_skip_verify: true # do not validate SSL certs
targets: # the ERP cluster members
- url: https://erp-1.int.net/health
- url: https://erp-2.int.net/health
- url: https://erp-3.int.net/health
# Configuration
The plugin is configured with a list of named checks. Each check can target one more endpoints.
Name | Type | Required | Default | Description |
---|---|---|---|---|
checks | list[TCP|UDP|ICMP|HTTP Check Configuration] | No | The health checks to perform |
# TCP Check Configuration
Name | Type | Required | Default | Description |
---|---|---|---|---|
name | string | Yes | The name for the check is it will appear in the generated metrics | |
check | string | Yes | Set to tcp | |
success_threshold | int | No | 1 | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | 1 | The number of consecutive failed checks before the endpoint is considered unhealthy |
interval | duration string | Yes | How often to run the health check | |
timeout | duration string | No | 10s | How long to wait for check to complete |
secure | boolean | No | false | Use TLS |
tls_skip_verify | boolean | No | false | Do not perform TLS certification validation |
port | int | No | TCP port to check | |
send | string | No | optional text or binary payload to send to endpoint | |
receive | string | No | if configured endpoint response must match this binary or text response | |
targets | list[TCP Check Target Configuration] | No | list of endpoints to check |
# TCP Check Target Configuration
Any un-configured parameters will use the settings from parent TCP Check Configuration
Name | Type | Required | Description |
---|---|---|---|
host | string | Yes | The host to check |
port | int | No | TCP port to check |
success_threshold | int | No | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | The number of consecutive failed checks before the endpoint is considered unhealthy |
timeout | duration string | No | How long to wait for check to complete |
secure | boolean | No | Use TLS |
tls_skip_verify | boolean | No | Do not perform TLS certification validation |
send | string | No | optional text or binary payload to send to endpoint |
receive | string | No | if configured endpoint response must match this binary or text response |
# UDP Check Configuration
Name | Type | Required | Default | Description |
---|---|---|---|---|
name | string | Yes | The name for the check is it will appear in the generated metrics | |
check | string | Yes | Set to udp | |
success_threshold | int | No | 1 | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | 1 | The number of consecutive failed checks before the endpoint is considered unhealthy |
interval | duration string | Yes | How often to run the health check | |
timeout | duration string | No | 5s | How long to wait for check to complete |
port | int | No | UDP port to check | |
send | string | No | optional text or binary payload to send to endpoint | |
receive | string | No | if configured endpoint response must match this binary or text response | |
targets | list[UDP Check Target Configuration] | No | list of endpoints to check |
# UDP Check Target Configuration
Any un-configured parameters will use the settings from parent UDP Check Configuration
Name | Type | Required | Description |
---|---|---|---|
host | string | Yes | The host to check |
port | int | No | UDP port to check |
success_threshold | int | No | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | The number of consecutive failed checks before the endpoint is considered unhealthy |
timeout | duration string | No | How long to wait for check to complete |
send | string | No | optional text or binary payload to send to endpoint |
receive | string | No | if configured endpoint response must match this binary or text response |
# ICMP Check Configuration
Name | Type | Required | Default | Description |
---|---|---|---|---|
name | string | Yes | The name for the check is it will appear in the generated metrics | |
check | string | Yes | Set to icmp | |
success_threshold | int | No | 1 | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | 1 | The number of consecutive failed checks before the endpoint is considered unhealthy |
interval | duration string | Yes | How often to run the health check | |
timeout | duration string | No | 1s | How long to wait for check to complete |
privileged | boolean | No | false | Use privileged ICMP |
targets | list[ICMP Check Target Configuration] | No | list of endpoints to check |
# ICMP Check Target Configuration
Any un-configured parameters will use the settings from parent ICMP Check Configuration
Name | Type | Required | Description |
---|---|---|---|
host | string | Yes | The host to check |
success_threshold | int | No | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | The number of consecutive failed checks before the endpoint is considered unhealthy |
timeout | duration string | No | How long to wait for check to complete |
privileged | boolean | No | Use privileged ICMP |
# HTTP Check Configuration
Name | Type | Required | Default | Description |
---|---|---|---|---|
name | string | Yes | The name for the check is it will appear in the generated metrics | |
check | string | Yes | Set to http | |
success_threshold | int | No | 1 | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | 1 | The number of consecutive failed checks before the endpoint is considered unhealthy |
interval | duration string | Yes | How often to run the health check | |
timeout | duration string | No | 10s | How long to wait for check to complete |
secure | boolean | No | false | Use TLS |
tls_skip_verify | boolean | No | false | Do not perform TLS certification validation |
method | string | No | GET | One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE" |
headers | map[string]string | No | http headers to pass to requests | |
response | string | No | if set, the http response body is expected to match this value | |
expected_statuses | string | No | 200 | Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401 |
proxy | string | No | optional http proxy url to use | |
targets | list[HTTP Check Target Configuration] | No | list of endpoints to check |
# HTTP Check Target Configuration
Any un-configured parameters will use the settings from parent HTTP Check Configuration
Name | Type | Required | Description |
---|---|---|---|
url | string | Yes | the HTTP url to hit. Supports encrypted secrets |
success_threshold | int | No | The number of consecutive successful checks before the endpoint is considered healthy |
failure_threshold | int | No | The number of consecutive failed checks before the endpoint is considered unhealthy |
interval | duration string | Yes | How often to run the health check |
timeout | duration string | No | How long to wait for check to complete |
secure | boolean | No | Use TLS |
tls_skip_verify | boolean | No | Do not perform TLS certification validation |
method | string | No | One of "OPTIONS", "GET", "HEAD", "POST", "PUT", "DELETE" |
headers | map[string]string | No | http headers to pass to requests |
response | string | No | if set, the http response is expected to match this value |
expected_statuses | string | No | Expected response status ranges.A status range defines the start and end of an http status code range using half-open interval semantics [start, end). start and end range is separated with '-'. end of range is optional. Example: 200-400 to expect status codes 200 through to 399, 401 to expect status code 401 |
proxy | string | No | optional http proxy url to use |