graphite-beacon
graphite-beacon copied to clipboard
PagerDuty incident resolution support
PagerDuty incidents are generated with messages like:
[BEACON] CRITICAL <Test Wildcard Alert> (stats.gauges.server2.data) failed. Current value: 1.0
"Back to normal" messages look like:
[BEACON] NORMAL <Test Wildcard Alert> (stats.gauges.server1.data) is back to normal.
From those, I figured that incident-unique information is the combination of alert name (<Test Wildcard Alert>
) and metrics name ((stats.gauges.server1.data)
). To avoid storing data on the file system, I decided to generate a hash out of those two. Using this hash value incidents can be triggered and resolved in a stateless way.
Following tests were performed:
- Tested alerts with exact metric match (
"query": "stats.gauges.test"
) - Tested wildcard metrics alerts (
"query": "stats.gauges.*.data"
)