Fastly error rate alarms are spammy
On the 11th and 12th I got about 100 notifications about fastly error rate alarms with the OSM community CDN. I'm not sure what was going on, but it looks like for some of the time the alarm was flapping.
I suspect there is a routing issue causing the failures. Needs investigation.
The errors appear to be due to first byte timeout
I have silenced the error.
Are we leaving the error permanently silenced?
I hope not! But no I assume @Firefishy set a time limited silence as I didn't notice a commit to change the alerts.
3 day time limit
Okay, then I'll leave this issue open as the alarm issue remains regardless of the routing issue.
What "alarm issue" is that exactly? Fastly were having intermittent problems reaching the origin server from one or more POPs so there were intermittent alarms.
What exactly would you have liked to be different?
The issue was that one event caused 100 alarms as different POPs caused different alarms to fire and resolve.
Was it one event or was it a series of event? Whatever unless you know something I don't there isn't a way to fix that.
As a trial in reducing the flappy alerts I have set keep_firing_for values for a number of different alerts: https://github.com/openstreetmap/chef/commit/26b1bdb9ddc8781526b9597ad79b0c566e4a6aaf
That should help cut down on noise