Reduce the Ops Car Alarms (Noisy Ignored Alerts)
There are many spurious car alarms (Ops Ignored Alerts)
TO BE COMPLETED with examples
- Alertmanager may need additional tuning
- Team should be aware how to silence.
- Reduce other alerts?
- Cronjob emails?
- Chef email alerts?
I do tune them as best I can but there are some which it's very hard to get right - any noise is not for want of trying!
Not a criticism of the improvements at all. I am partially to blame as some of my stuff has been needlessly alerting until I finally started using Alertmanager silencing.
I do wish alertmanager had nagios's "acknowledge" feature where it silences it not for a fixed time, but until the alarm clears and then it resets and will alert again if it retriggers.
It's great for things like hardware faults where you don't know how long they will take to fix - you log a ticket or whatever and then acknowledge the alert and as soon as it is fixed the alert rearms.
One thing that I would like to get rid of is the old hwraid monitors and their alerts given we have prometheus alerting on degraded arrays now, so Think cciss-vol-statusd and the like can go now?