In this lesson we will learn how to create, configure, and schedule checks, and how to select which hosts to run them on, using subscriptions. You will learn more detail about how the backend and agents communicate, and how to integrate existing monitoring plugins. This lesson is intended for operators of Sensu, and assumes you have set up a local workshop environment.
In this lesson we will learn how to create, configure, and schedule checks, and how to select which hosts to run them on, using subscriptions. You will learn more detail about how the backend and agents communicate, and how to integrate existing monitoring plugins. This lesson is intended for operators of Sensu, and assumes you have set up a local workshop environment.
In Sensu, checks are monitoring jobs that are managed by the Sensu control plane, and executed by Sensu Agents. A Sensu Check is a modern take on the traditional “service check” performed by a monitoring platform to determine the status of a system or service.
type: CheckConfig
api_version: core/v2
metadata:
name: node_exporter
spec:
command: wget -q -O- http://127.0.0.1:{{ .labels.node_exporter_port | default "9100" }}/metrics
runtime_assets: []
publish: true
interval: 30
subscriptions:
- linux
timeout: 10
ttl: 0
output_metric_format: prometheus_text
output_metric_handlers:
- elasticsearch
output_metric_tags:
- name: entity
value: "{{ .name }}"
- name: region
value: "{{ .labels.region | default 'unknown' }}"
Although service checks were originally popularized by Nagios (circa 1999-2002), they continue to fill a critical role in the modern era of cloud computing. Sensu orchestrates service checks in a similar manner as cloud-native platforms like Kubernetes and Prometheus which use “Jobs” as a central concept for scheduling and running tasks. Where Prometheus jobs are limited to HTTP GET requests, a Sensu check provides a significantly more flexible tool.
A service check can be any program that satisfies the following requirements:
STDOUT
The exit status code is expected to follow these conventions:
0
indicates OK
.1
indicates WARNING
.2
indicates CRITICAL
.0
, 1
, and 2
indicate an UNKNOWN
or custom statusThat’s the entire specification (more or less)! Service checks remain very useful because their simple specification makes it easy to extend monitoring to any area. Service checks can be written in any programming or scripting language, including Bash, PowerShell, and MS-DOS scripts.
In Sensu, subscriptions are equivalent to topics in a traditional pub/sub model. The backend is the publisher and the agents is the subscriber.
Checks are scheduled at pre-set intervals. The backend automatically publishes a request, and agents who are subscribed to the topic receive the request. The agent then performs the corresponding check, and sends the event data to the backend for processing via the observability pipeline.
Check scheduling is configured using the following attributes:
publish
: enables or disables schedulinginterval
or cron
: the schedule in cron formatsubscriptions
: the subscriptions to publish check requests toround_robin
: limits check scheduling to one execution per request (useful for configuring pollers when there are multiple agent members in a given subscription)timeout
: how much time, in seconds, to allow a check to run before terminating itYou have a collection of servers and you want to start monitoring their disk usage. You want this to run on all Mac, Linux, and Windows hosts that Sensu is aware of.
To accomplish this, we can configure a check that will periodically pull the disk usage metrics from your servers. We will use the check-disk-usage
plugin available in Bonsai and configure it to run every 30 seconds. The agents are already configured to subscribe to system/macos
, system/windows
, and system/linux
requests, so we can start running this right away.
Configure a Check to Monitor Disk Usage.
Copy and paste the following contents to a file named disk.yaml
:
---
type: CheckConfig
api_version: core/v2
metadata:
name: disk-usage
spec:
command: check-disk-usage --warning 80.0 --critical 90.0
runtime_assets:
- sensu/check-disk-usage:0.6.0
publish: true
interval: 30
subscriptions:
- system/macos
- system/macos/disk
- system/windows
- system/windows/disk
- system/linux
- system/linux/disk
timeout: 10
check_hooks: []
handlers:
- mattermost
Notice the values of subscriptions
and interval
– these will instruct Sensu to publish checks every 30 seconds on any agent with the system/macos
, system/windows
, or system/linux
subscriptions.
Agents opt-in to checks via their subscriptions
configuration.
Create the Check Using the sensuctl create
Command.
sensuctl create -f disk.yaml
Verify that the check was successfully created using the sensuctl check list
command:
sensuctl check list
Example Output:
Name Command Interval Cron Timeout TTL Subscriptions Handlers Assets Hooks Publish? Stdin? Metric Format Metric Handlers
────── ───────────────────────────────────────────────── ────────── ────── ───────── ───── ────────────────────────────────────────────────────────────────────────────────────────────────── ────────── ────────────────────────────── ─────── ────────── ─────────────────────── ─────────────────
disk-usage check-disk-usage --warning 80.0 --critical 90.0 30 10 0 system/macos,system/macos/disk,system/windows,system/windows/disk,system/linux,system/linux/disk sensu/check-disk-usage:0.6.0 true false
NEXT: Do you see the disk
check in the output?
If so, you’re ready to move on to the next exercise!
Sensu’s pub/sub configuration model makes monitoring configuration easier to manage at scale. A single check definition can be used to collect monitoring data from hundreds or thousands of endpoints!
However, there are often cases when you need to override check configurations on a per-endpoint basis. For these situations, Sensu provides a templating feature called Tokens.
Checks can be templated using placeholders called tokens which are replaced with entity information before the job is executed.
Tokens are references to entity attributes and metadata, wrapped in double curly braces ({{ }}
).
Default values can also be provided as a fallback for unmatched tokens.
Examples:
{{ .name }}
: replaced by the target entity name{{ .labels.url }}
: replaced by the target entity “url” label{{ .labels.disk_warning | default "85.0" }}
: replaced by the target entity “disk_warning” label; if the label is not set then the default/fallback value of 85.0
will be usedTokens can be used to configure dynamic monitoring jobs (e.g. enabling node-based configuration overrides for things like alerting threshold, etc).
Let’s modify our check from the previous exercise using some tokens.
You’ve noticed that “one size fits all” is not true for your infrastructure. While the default disk check values are working for most of the servers in your system, certain servers are more sensitive to disk space usage. You want to use different warning/critical thresholds for them.
This can be accomplished using entity annotations, tokens, and a templated check. First, by setting annotations on the entity we can configure a value that is unique to that entity. Then, we can use a token to read that value when the check is executed, and give the check executable a different configuration.
Update the disk
Check Configuration.
Modify disk.yaml
with the following contents:
---
type: CheckConfig
api_version: core/v2
metadata:
name: disk-usage
spec:
command: >-
check-disk-usage
--warning {{ .annotations.disk_usage_warning_threshold | default "80.0" }}
--critical {{ .annotations.disk_usage_critical_threshold | default "90.0" }}
runtime_assets:
- sensu/check-disk-usage:0.6.0
publish: true
interval: 30
subscriptions:
- system/macos
- system/macos/disk
- system/windows
- system/windows/disk
- system/linux
- system/linux/disk
timeout: 10
check_hooks: []
handlers:
- mattermost
This configuration makes the disk usage warning and critical thresholds configurable via entity annotations (disk_usage_warning_threshold
and disk_usage_critical_threshold
).
We also provided default values, which are used if the annotation is not set.
Update the Check Using sensuctl create
.
sensuctl create -f disk.yaml
Verify that the check was successfully created:
sensuctl check info disk-usage --format yaml
Update the Agent Configuration to Add Annotations.
There are two ways to add the needed annotations, either by updating the entity configuration or the agent configuration. In this exercise we will update the agent configuration.
Stop the Agent
If you started your agent in the previous exercise using the sensu-agent start
command, you can stop the agent by pressing Control-C
in your terminal.
Edit the Agent Configuration
In Lesson 7 we started an agent, and created a agent.yaml
file containing most of its configuration.
Add disk_usage_warning_threshold
and disk_usage_critical_threshold
to the annotations
list:
---
backend_url: ws://127.0.0.1:8080
name: workshop
labels:
foo: bar
environment: training
annotations:
disk_usage_warning_threshold: "50.0"
disk_usage_critical_threshold: "70.0"
agent-managed-entity: true
deregister: true
Restart the Agent.
Let’s start the agent from the command line, this time using a mix of environment variables and our configuration file.
MacOS:
TMPDIR=/opt/sensu/tmp \
SENSU_SUBSCRIPTIONS="system/macos workshop" \
sudo -E sensu-agent start \
--config-file /opt/sensu/agent.yaml \
--cache-dir /opt/sensu/sensu-agent/cache \
--user ${SENSU_USER} \
--password ${SENSU_PASSWORD}
Windows (PowerShell):
${Env:SENSU_SUBSCRIPTIONS}="system/windows workshop" `
sensu-agent start `
--config-file "${Env:UserProfile}\Sensu\agent.yaml" `
--user ${Env:SENSU_USER} `
--password ${Env:SENSU_PASSWORD}
Linux:
SENSU_SUBSCRIPTIONS="system/linux workshop" \
sudo -E -u sensu sensu-agent start \
--config-file "/etc/sensu/agent.yaml" \
--user ${SENSU_USER} \
--password ${SENSU_PASSWORD}
One common use case for checks is to collect system and service metrics (e.g. cpu, memory, or disk utilization; or api response times).
To learn more about Sensu’s metrics processing capabilities, please visit the Sensu Metrics reference documentation.
The agent provides built-in support for normalizing metrics generated by service checks in the following formats:
prometheus_text
: Prometheus exposition formatinfluxdb_line
: InfluxDB line protocolopentsdb_line
: OpenTSDB line protocolgraphite_plaintext
: Graphite plaintext protocolnagios_perfdata
: Nagios Performance DataConfiguring output_metrics
causes the agent to extract metrics at the edge, before sending event data to the observability pipeline, optimizing performance of the platform at scale.
NOTE: Sensu also provides support for collecting StatsD metrics, however these are consumed via the StatsD API – not collected as output of a check.
Metrics extracted with output_metrics_format
can also be enriched using output_metric_tags
.
Metric sources vary in verbosity. Some metric formats don’t support tags (e.g. Nagios Performance Data), and those that do might be implemented in ways that don’t provide enough contextual data.
In either case, Sensu’s output_metric_tags
are great for enriching collected metrics using entity data/metadata.
Sensu breathes new life into legacy monitoring plugins and other metric sources that generate the raw data you care about, but which lack tags or other context to make sense of the data. Simply configure output_metric_tags
and Sensu will add the tag data to the resulting metrics.
Example: Metrics Tags
output_metric_tags:
- name: application
value: "my-app"
- name: entity
value: "{{ .name }}"
- name: region
value: "{{ .labels.region | default 'unknown' }}"
- name: store_id
value: "store/{{ .labels.store_id | default 'none' }}"
Metric tag values can be provided as strings, or tokens which can be used for generating dynamic tag values.
In addition to output_metric_format
, Sensu checks also provide configuration for dedicated output_metric_handlers
, which are event handlers that are specially optimized for processing metrics.
If an event containing metrics is configured with one or more output_metric_handlers
, a copy of the event is forwarded to the metric handler prior to Sensu’s own event persistence; this specialized handling is implemented as a performance optimization to prioritize metric processing.
NOTE: Checks may be configured with multiple
handlers
andoutput_metric_handlers
, enabling service health checking, alerting, and metrics collection in a single check.
You have some existing monitoring plugins which provide output in Nagios format, which you want to store in a data platform like Sumo Logic or InfluxDB. You also want to capture some additional metadata about the server or service along with the metrics, and have it processed as a single unit by the pipeline.
This can be accomplished by configuring the check to expect Nagios formatted metrics by using the output_metric_format
option, and configuring a metrics-specific storage handler via output_metric_handlers
.
We can also add additional metadata to the event using output_metric_tags
.
Update the disk
Check Configuration.
Modify disk.yaml
with the following contents (adding output_metric_format
, output_metric_handlers
, and output_metric_tags
fields):
---
type: CheckConfig
api_version: core/v2
metadata:
name: disk-usage
spec:
command: >-
check-disk-usage
--metrics
--warning {{ .annotations.disk_usage_warning_threshold | default "80.0" }}
--critical {{ .annotations.disk_usage_critical_threshold | default "90.0" }}
runtime_assets:
- sensu/check-disk-usage:0.6.0
publish: true
interval: 30
subscriptions:
- system/macos
- system/macos/disk
- system/windows
- system/windows/disk
- system/linux
- system/linux/disk
timeout: 10
check_hooks: []
handlers:
- mattermost
output_metric_format: prometheus_text
output_metric_handlers:
- sumologic-metrics
output_metric_tags:
- name: entity
value: "{{ .name }}"
- name: namespace
value: "{{ .namespace }}"
Understanding the YAML
We made a number of small changes here. These fields instruct Sensu what metric format to expect as output from the check, which handler(s) should be used to process the metrics, and what tags should be added to the metrics:
Added
--metrics
option tocheck-disk-usage
. This tells the check to change it’s output from the human-readable format we were using previously to instead output in Prometheus format.Added
output_metric_format: prometheus_text
. This tells Sensu to expect Prometheus formatted output from the check. The backend will parse the check’s output into structured data in the event’smetrics
property.Added
output_metric_handlers
list withsumologic-metrics
handler. This tells the backend to send the metrics data to the Sumologic handler, which will record it in the data platform.Added
output_metric_tags
property. This defines two tags,entity
andnamespace
, which we will extract using the tokens.name
and.namespace
. These tags will be added as metadata in themetrics
data structure on each metric.
Update the Check using sensuctl create
.
sensuctl create -f disk.yaml
Verify that the Check was Successfully Created.
sensuctl check info disk-usage --format yaml
Verify that the Check Produces an Event.
sensuctl event list
Inspect the Event Output.
sensuctl event info workshop disk-usage --format yaml
NEXT: If you can see the check running, with event data being produced with the Prometheus output style, you’re ready to move on to the next step.
In this lesson you learned how to configure checks, which are periodic monitoring jobs, and how to select which hosts to run checks on using subscriptions. You also learned how to use tokens to template the check to have a unique configuration on each host, and covered some powerful tools to help modernize an older Nagios-based monitoring solution.
The publish/subscribe model is powerful in ephemeral or elastic infrastructures, where endpoint identifiers are unpredictable and break traditional host-based monitoring configuration.
Instead of configuring monitoring on a per-host basis, Sensu follows a service-based model, with one subscription per service (e.g. “postgres”), and agents ephemeral compute instances, simply register with a Sensu backend, subscribe to the relevant monitoring topics and begin reporting observability data.
Because subscriptions are loosely coupled references, Sensu checks can be configured with subscriptions that have no agent members and the result is simply a “no-op” (no action is taken).
The Agent Events API makes it easy to implement Dead Man Switches with as little as one line of Bash or PowerShell (see below for examples).
This can be implemented via the event.check.ttl
property in the event specification. This can be set to instruct Sensu to expect a continued stream of events. If there is a delay between events longer than the configured TTL, Sensu will generate a TTL event with a status like “Last check execution was 120 seconds ago”.
Dead Man Switches are useful for monitoring jobs like nightly backup jobs. For example, you could add a line at the end of a cron’ed Bash script to report on the backup status, with a ~25hr TTL. A failed backup job will result in a TTL event without any additional if/else-style conditional logic, or any additional code to send a “job failed” event. The absence of an “OK” event sent during the TTL window is all that is needed.
Examples: Dead Man Switches using TTLs
MacOS/Linux
curl -XPOST -H 'Content-Type: application/json' -d '{"check":{"metadata":{"name":"dead-mans-switch"},"output":"Alert if another event is not received in 30s","status":0,"ttl":30}}' 127.0.0.1:3031/events
Windows (PowerShell)
Invoke-RestMethod -Method POST -ContentType "application/json" -Body '{"check":{"metadata":{"name":"dead-mans-switch"},"output":"Alert if another event is not received in 30s","status":0,"ttl":30}}' -Uri "${Env:SENSU_API_URL}/api/core/v2/namespaces/${Env:SENSU_NAMESPACE}/events"
The Sensu scheduler can also checks for entities that are not actively managed by a Sensu agent. These monitoring jobs are called proxy checks, or checks that target a proxy entity.
At a high level, a proxy check is a Sensu check with proxy_requests
, which are query parameters Sensu will use to look for entities that should be targeted by the check.
In the following example, we would expect Sensu to find two (2) entities with entity_class == "proxy"
and a proxy_type
label set to "website"
. For each matching entity, the backend will first replace the tokens using entity attributes. This would create one request to execute the command nslookup sensu.io
, and one request to execute the command nslookup google.com
.
To avoid redundant processing, we recommend using the round_robin
attribute with proxy checks.
---
type: CheckConfig
api_version: core/v2
metadata:
name: proxy-nslookup
spec:
command: >-
nslookup {{ .annotations.proxy_host }}
runtime_assets: []
publish: true
subscriptions:
- workshop
interval: 30
timeout: 10
round_robin: true
proxy_requests:
entity_attributes:
- entity.entity_class == "proxy"
- entity.labels.proxy_type == "website"
---
type: Entity
api_version: core/v2
metadata:
name: proxy-a
labels:
proxy_type: website
annotations:
proxy_host: sensu.io
spec:
entity_class: proxy
---
type: Entity
api_version: core/v2
metadata:
name: proxy-b
labels:
proxy_type: website
annotations:
proxy_host: google.com
spec:
entity_class: proxy
Proxy entities are discussed in greater detail in Lesson 13: Introduction to Proxy Entities & Proxy Checks.