datadog-checks
datadog-checks copied to clipboard
Checks for the Datadog Agent that Stripe finds useful.
Stripe Datadog checks
This is a collection of plugins — checks in Datadog parlance — for the Datadog agent that Stripe has found useful with Datadog.
Motivation
We've sent a lot of patches to Datadog and we regularly work closely with them on our ideas. But sometimes we want something that isn't a fit for the mainline Datadog agent. To that end we've created this repository to hold work that is either in flight or was decided to not be a fit for inclusion in the core agent set. We hope you find it useful!
Using The Checks
Place the .py
file you want to use in to the checks directory — /etc/dd-agent/checks.d
by default — and the YAML config file in the config directory — /etc/dd-agent/conf.d
by default — and you should be ready to go! Restart the agent and run /etc/init.d/datadog-agent info
to verify that the plugin is working.
Each plugin here is provided with a sample config file containing some documentation.
Checks
Here's our list of checks!
File
Uses Python's glob.glob
to look look for at least one file matching the provided path
. You can control the success or failure of this check via expect
using one of present
or absent
. For example if you use expect: present
and the file does not exist, this check will fail. If you use expect: absent
and the file is absent, it will emit ok!
The service check and any emitted metrics are tagged with the path
, expected_status
and actual_status
. It's check message will be File %s that was expected to be %s is %s instead" % (path, expect, status)
.
If this check does find a path that matches it will also emit a gauge file.age_seconds
containing the age of the oldest file in seconds that matches the path.
---
init_config:
instances:
# Puppet locks (these might turn stale):
- path: '/etc/stripe/facts/puppet_locked.txt'
expect: absent
# Package upgrades requiring reboots
- path: '/var/run/stripe/restart-required/*'
expect: absent
Jenkins Metrics
Fetches metrics from Jenkin's Metrics Plugin (which you must install separately). It fetches all the metrics under vm.*
and emits them as gauges except the vm.gc.*.count
and vm.gc.*.time
which are emitted as monotonic_count
.
Linux VM Extras
Fetches the following metrics by polling Linux' '/proc/vmstat':
-
system.linux.vm
-
pgpgin
aspages.in
, -
pgpgout
aspages.out
, -
pswpin
aspages.swapped_in
, -
pswpout
aspages.swapped_out
, -
pgfault
aspages.faults
, -
pgmajfault
aspages.major_faults
-
NSQ
Fetches the following metrics by polling NSQ's /stats
endpoint:
-
nsq.topic_count
-
nsq.topic.channel_count
-
nsq.topic
(all tagged withtopic_name
):-
depth
-
backend_depth
-
message_count
(count, not gauge)
-
-
nsq.topic.channel
(all tagged withtopic_name
andchannel_name
):-
depth
-
backend_depth
-
in_flight_count
-
deferred_count
-
message_count
(count, not gauge) -
requeue_count
-
timeout_count
-
e2e_processing_latency.p50
(nanoseconds) -
e2e_processing_latency.p95
(nanoseconds) -
e2e_processing_latency.p99
(nanoseconds) -
e2e_processing_latency.p999
(nanoseconds) -
e2e_processing_latency.p9999
(nanoseconds)
-
-
nsq.topic.channel.client
(all tagged withtopic_name
,channel_name
andclient_id
,client_version
,tls
,user_agent
,deflate
andsnappy
):-
ready_count
-
in_flight_count
-
message_count
(count, not gauge) -
finish_count
-
requeue_count
-
If you have extra tags you would like to parse from any of your topic names, you can include topic_name_regex
as a
Python regex in your init_config
. The regex will be applied to each topic name and if there is a match, the name of
the symbolic group and the value it captured will be included as a tag key/value pair.
Nagios Runner
The Nagios Runner check takes a list of check "instances". The instances are each executed and, according to the Nagios Plugin API the return value is inspected and a service check is submitted using the provided name.
Note: The checks supplied are executed sequentially. You may run in to performance issues if you attempt to run too many checks or checks that execute very slowly. This will effectively block the agent and cause all sorts of hiccups!
init_config:
# Not needed
instances:
- name: "stripe.check.is_llama_on_rocket"
command: "/usr/lib/nagios/plugins/check_if_llama_is --on rocket"
- name: "stripe.check.falafel_length"
command: "/usr/lib/nagios/plugins/check_falafel -l 1234"
OpenVPN
The OpenVPN check counts the number of active VPN connections per user. Combined with a Datadog monitor, it ensures that the same user isn't logged in too many times (e.g., multiple, sustained VPN connections for the same user is indicative of a laptop compromise).
Each VPN is accessible over both TCP and UDP, and is available to both privileged (Stripe employees) and unprivileged users (vendors). The unique combination of these is considered a VPN "level", and OpenVPN emits a status file every 10 seconds for each level to indicate the currently-active connections. When a user disconnects (e.g. if their Internet connection drops out) or if their IP address/port changes, they may appear in the status file multiple times. This is fine, as long as the number of connections per user drops down to 1 within a minute or so.
The status file also contains useful information such as the IP (which can be used for geolookups), the connection duration (which can be used to ensure that the VPN is online and that it isn't cycling users), and the number of bytes sent/received (which could be used to detect erratic behavior).
Out of memory killer (OOM)
This check emits a failure when any process has been killed by the OOM killer since the system last started up. It continues to emit criticals until the log file is removed or the system is restarted (providing the log file contains uptimes to detect a reboot)
It reads the configured logfile
as syslog kernel output for lines matching the kernel_line_regex
property. The regular expression should provide named capture groups for message
and, optionally, uptime
. The uptime
capture group is how it detects system reboots; it will stop looking for OOM instances when it detects a reboot. The second configurable regular expression extracts information from the message
data itself, which is included in the service check message (not as tags, as this would pose problems with alert recovery).
An example configuration for a base Ubuntu system goes like this:
---
init_config:
instances:
- logfile: '/var/log/kern.log'
kernel_line_regex: '^(?P<timestamp>.+?) (?P<host>\S+) kernel: \[\s*(?P<uptime>\d+(?:\.\d+)?)\] (?P<message>.*)$'
kill_message_regex: '^Out of memory: Kill process (?P<pid>\d+) \((?P<pname>.*?)\) score (?P<score>.*?) or sacrifice child'
This file is included in conf.d/oom.yaml
.
Two error cases also emit service checks:
- If the log file is not present, a warning is emitted; this is not inherently a problem but could indicate misconfiguration
- If a permission error prevents dd-agent from reading the file, a critical is emitted; this is a definite failure and needs correcting
Segfault
This check emits a count of segfaults over a time window as a gauge, tagged by the process name.
Example configuration looks like this:
---
init_config:
instances:
- logfile: '/var/log/kern.log'
kernel_line_regex: '^(?P<timestamp>.+?) (?P<host>\S+) kernel: \[\s*(?P<uptime>\d+(?:\.\d+)?)\] (?P<message>.*)$'
process_name_regex: '^(?P<process>[^\[]+)\[(?P<pid>\d+)\]: segfault'
timestamp_format: '%b %d %H:%M:%S'
time_window_seconds: 60
This file is included in conf.d/oom.yaml
You can optionally specify a tags
option to the instance config to add extra tags to any metrics emitted.
Errors for this check are emitted as tagged counters with the metric name system.segfault.errors
. The tag type
indicates the kind of error that was encountered:
-
type:config
is an error loading config data that is expected to exist, or an error creating the regular expressions -
type:io
is an error reading the log file specified -
type:parse
is an error extracting themessage
component ofkernel_line_regex
or an error parsing thetimestamp
When successful, if any segfaults are found in the log within the last time_window_seconds
seconds at the time the check runs, they are emitted as system.segfault.count
with the tags process:<process name>
and time_window:<value of time_window_seconds>
.
Outdated Packages
This check verifies that the given packages are not outdated (currently, only on Ubuntu). You can specify a set of package names and versions (split out by release), and this check will report critical if the current version of that package is older than the specified version. For example:
init_config:
# Not needed
instances:
- package: bash
version:
precise: "4.2-2ubuntu2.6"
trusty: "4.3-7ubuntu1.5"
- package: openssl
version:
precise: "1.0.1-4ubuntu5.31"
trusty: "1.0.1f-1ubuntu2.15"
Resque
Inspects the Redis storage for a Resque instance and ouputs some metrics:
-
resque.jobs.failed_total
- number of jobs failed (monotonic_count) -
resque.jobs.processed_total
- number of jobs processed (monotonic_count) -
resque.queues_count
- number of queues (gauge) -
resque.worker_count
- number of workers (gauge)
Slapd (OpenLDAP's Stand-alone LDAP Daemon)
This check queries and surfaces statistics from the monitor
backend of a running slapd
instance. It will emit the following
metrics:
-
slapd.connect_time
- time taken to connect to the server (histogram) -
slapd.connections.total
- total number of connections (monotonic_count) -
slapd.connections.current
- current number of connections (gauge) -
slapd.statistics.bytes_total
- total bytes sent (monotonic_count) -
slapd.statistics.entries_total
- total entries sent (monotonic_count) -
slapd.threads.active
- number of active threads (gauge) -
slapd.threads.open
- number of open threads (gauge) -
slapd.threads.pending
- number of pending threads (gauge) -
slapd.threads.starting
- number of threads being started (gauge) -
slapd.waiters.read
- the number of clients waiting to read (gauge) -
slapd.waiters.write
- the number of clients waiting to write (gauge)
In addition, the check will emit a service check (slapd.can_connect
) that
indicates whether it was able to successfully connect to the LDAP server.
Slapd Configuration
To enable the monitor
backend, you can add the following lines to
slapd.conf
:
moduleload back_monitor
database monitor
access to dn="cn=monitor"
by peername=127.0.0.1 read
by * none
This allows only clients on the local machine to access the backend, since it may contain potentially-sensitive information.
Storm REST API
This check comes in two parts: One is a cronjob-able script in
scripts/cache-storm-data
(intended to run every minute, or whichever
interval doesn't overload your nimbus), and the other is a check that
reads the generated JSON file and emits metrics.
For the check, we recommend running it at an interval 2x faster than
the cache-storm-data cron job runs (using the
min_collection_interval: <Nsec>
config parameter in init_config
).
You can configure the topologies considered for emission using the
topologies
regex, and the check will group all the matched metrics
(picking the youngest ACTIVE
metric for each that have name
collisions).
The caching process can be very time-consuming since storm's executor and per-topology stats take a really long time to generate. It's best to run the cache script a few times across the lifetime of your storm topologies to get a feel for how long it takes and how resource-intensive the metrics-gathering can be.
The storm_rest_api.yaml
config file is used by both the
cache strip and the check.
Splunk
Collects metrics from a Splunk master about the status of a Splunk cluster. It assumes you are using Search Head Clustering and queries the SHC captain for search information.
It emits these service checks:
-
splunk.can_connect
when things break during fetching status -
splunk.index.is_healthy
for "unhealthy" indices, tagged byindex_name
. See the message for more details. -
splunk.peer.is_healthy
for "unhealthy" nodes, tagged bypeer_name
. See the message for more details.
It emits these metrics:
-
splunk.fixups
-
jobs_present
tagged byindex_name
andfixup_level
-
-
splunk.indexes
tagged byindex_name
-
replication
tagged byindex_copy
, for each "copy"-
actual_copies
- Number of copies that actually exist. -
expected_copies
- Number of copies that should exist.
-
-
search
tagged byindex_copy
, for each "copy"-
actual_copies
- Number of copies that actually exist. -
expected_copies
- Number of copies that should exist.
-
-
size_bytes
- The total size in bytes. -
total_excess_bucket_copies
- The total number of excess copies for all buckets. -
total_excess_searchable_copies
- The total number of excess searchable copies for all buckets.
-
-
splunk.peers
tagged bypeer_name
, andsite
-
bucket_count
- The number of buckets on this peer tagged additionally byindex
. -
bucket_status
- The number of buckets in a given status on this peer, tagged additionally bybucket_status
. -
delayed_buckets_to_discard
- The number of buckets waiting to be discarded on this peer. -
peers_present
- The number of peers available (as a gauge) tagged additionally bystatus
. -
primary_count
- The number of buckets for which the peer is primary in its local site, or the number of buckets that return search results from same site as the peer. -
primary_count_remote
- The number of buckets for which the peer is primary that are not in its local site. -
replication_count
- The number of replications this peer is part of, as either source or target.
-
-
splunk.search_cluster
-
captains
- The count of captains tagged bysite
. This can be used to ensure a captain and detect splitbrain -
member_statuses
- The number of members in the search cluster tagged bystatus
andsite
.
-
-
splunk.searches
-
in_progress
- In progress search gauge, tagged byis_saved
andsearch_owner
.
-
You can configure it thusly:
---
init_config:
default_timeout: 30
instances:
- url: https://localhost:8089
username: obsrobot
password: foobar
SubDir Sizes
The SubDir Sizes is a sister to Datadog's directory
integeration. Our needs required enough differences that making a new integration
seemed the easier path and made for a less complex configuration. It takes a directory
and emits a total size (in bytes) and a
count of files therein for each subdirectory it finds. It also can use a regular expression to dynamically create tags for each subdirectory.
This integation is useful for getting tag-friendly metrics for backup directories and things like Kafka that store in subdirectories.
Here's the config we use for Kafka:
init_config:
instances:
- directory: "/pay/kafka/data"
dirtagname: "name"
subdirtagname: "topic"
subdirtagname_regex: "(?P<topic>.*)-(?P<partition>\\d+)"
Note: The regular expression provided to subdirtagname_regex
should use named groups
such that calling groupdict()
on the resulting match provides name-value pairs for use as tags!
And here are the metrics, each of which will be tagged with $dirtagname:$DIRECTORY
and $subdirtagname:basename(subdir)
and whatever tags come from subdirtagname_regex
:
-
system.sub_dir.bytes
-
system.sub_dir.files