govuk-puppet icon indicating copy to clipboard operation
govuk-puppet copied to clipboard

Increase default lookback time for Graphite alerts to 10m.

Open sengi opened this issue 3 years ago • 1 comments

Our old single-threaded Carbon setup struggles under load and intermittently drops data points, causing Graphite-based Icinga alerts to flap in and out of "unknown" status.

While it would be good to fix the underlying problem, a few hours' investigation didn't reveal any clear root causes of the dropped Carbon data points. Increasing the default lookback window for the metric queries behind these Icinga alerts should significantly reduce the noise in the meantime.

I had a quick look through all of the Graphite/Carbon-based alerts that we currently have, and I didn't find anything important-looking that wasn't already providing a specific value for the from argument. So this basically only affects unactionable alerts like "CPU usage high" anyway. And besides, 10 minutes is still quite a short window for an alert query. Many of the existing alerts are already specifying 15m.

sengi avatar Mar 11 '21 15:03 sengi

Lol it turns out the tests depend on the defaults. Oh dear.

sengi avatar Mar 11 '21 15:03 sengi