operations icon indicating copy to clipboard operation
operations copied to clipboard

Reduce the Ops Car Alarms (Noisy Ignored Alerts)

Open Firefishy opened this issue 2 years ago • 4 comments

There are many spurious car alarms (Ops Ignored Alerts)

TO BE COMPLETED with examples

  • Alertmanager may need additional tuning
  • Team should be aware how to silence.
  • Reduce other alerts?
  • Cronjob emails?
  • Chef email alerts?

Firefishy avatar Mar 23 '23 18:03 Firefishy

I do tune them as best I can but there are some which it's very hard to get right - any noise is not for want of trying!

tomhughes avatar Mar 23 '23 18:03 tomhughes

Not a criticism of the improvements at all. I am partially to blame as some of my stuff has been needlessly alerting until I finally started using Alertmanager silencing.

Firefishy avatar Mar 23 '23 18:03 Firefishy

I do wish alertmanager had nagios's "acknowledge" feature where it silences it not for a fixed time, but until the alarm clears and then it resets and will alert again if it retriggers.

It's great for things like hardware faults where you don't know how long they will take to fix - you log a ticket or whatever and then acknowledge the alert and as soon as it is fixed the alert rearms.

tomhughes avatar Mar 23 '23 19:03 tomhughes

One thing that I would like to get rid of is the old hwraid monitors and their alerts given we have prometheus alerting on degraded arrays now, so Think cciss-vol-statusd and the like can go now?

tomhughes avatar Mar 23 '23 20:03 tomhughes