graphite-beacon icon indicating copy to clipboard operation
graphite-beacon copied to clipboard

PagerDuty incident resolution support

Open ivachkov opened this issue 8 years ago • 0 comments

PagerDuty incidents are generated with messages like:

[BEACON] CRITICAL <Test Wildcard Alert> (stats.gauges.server2.data) failed. Current value: 1.0

"Back to normal" messages look like:

[BEACON] NORMAL <Test Wildcard Alert> (stats.gauges.server1.data) is back to normal.

From those, I figured that incident-unique information is the combination of alert name (<Test Wildcard Alert>) and metrics name ((stats.gauges.server1.data)). To avoid storing data on the file system, I decided to generate a hash out of those two. Using this hash value incidents can be triggered and resolved in a stateless way.

Following tests were performed:

  • Tested alerts with exact metric match ("query": "stats.gauges.test")
  • Tested wildcard metrics alerts ("query": "stats.gauges.*.data")

ivachkov avatar Oct 11 '16 06:10 ivachkov