govuk-puppet
govuk-puppet copied to clipboard
Increase default lookback time for Graphite alerts to 10m.
Our old single-threaded Carbon setup struggles under load and intermittently drops data points, causing Graphite-based Icinga alerts to flap in and out of "unknown" status.
While it would be good to fix the underlying problem, a few hours' investigation didn't reveal any clear root causes of the dropped Carbon data points. Increasing the default lookback window for the metric queries behind these Icinga alerts should significantly reduce the noise in the meantime.
I had a quick look through all of the Graphite/Carbon-based alerts that we currently have, and I didn't find anything important-looking that wasn't already providing a specific value for the from
argument. So this basically only affects unactionable alerts like "CPU usage high" anyway. And besides, 10 minutes is still quite a short window for an alert query. Many of the existing alerts are already specifying 15m.
Lol it turns out the tests depend on the defaults. Oh dear.