Expose counter metric for notifications that weren't sent due to silences
What did you do?
I ran into an edge case when monitoring an alertmanager. I want to see what proportion of my actual notifications are covered by silences. AFAIK, we need to introduce a new metric to solve this. If there's some PromQL magic that I missed that can be used to derive this value, please let me know.
We should expose a new counter, something like alertmanager_notifications_silenced, that counts for notifications that were not sent due to the presence of a silence.
We have a few similar metrics that don't exactly solve this.
alertmanager_silences_active- This is a gauge that tracks the current number of active alerts covered by silences. This is subtly different, as:- Multiple notifications might be sent for a single alert (repeat_interval)
- Silences might be created or expired between scrapes. If you carefully create and expire two different silences between the same scrape, you end up with more silenced notifications but no visible change in the value of this gauge. So, it doesn't capture all the information necessary.
alertmanager_notifications_totalbyintegration- Doesn't seem to care about silences, and only tracks notifications by receiver.
What did you expect to see?
I wanted to count the number of actual notifications that were silenced over time, from the emitted metrics.
Environment
- System information:
n/a
- Alertmanager version:
main
- Prometheus version:
n/a
- Alertmanager configuration file:
n/a - any
- Prometheus configuration file:
n/a - any
- Logs:
n/a
Great shout! I think we should also introduce a metric for when an alert is deduped.
Deduping, Silencing, and Notify all happen within the "Stage Pipeline" but right now, we only have visibility on the "notify" stage. This seems incorrect to a certain extend.
A few things to consider from me are:
- Do we want extra labels on
notifications_total,notifications_failed_total,notification_requests_total,notification_requests_failed_totalas oppose to new metrics? Does that even make sense in this case? I think that for silencing it doesn't, but for something like deduping it does. - Instead of
alertmanager_notifications_silencedI'd suggestalertmanager_notifications_supressed_totalas the muting is not only done via silences but also mute timings and inhibitions - a corollary to that is, do we care of that distinction? We have a number of "bugs" re mute timings and inhibition that are close to impossible to debug or reproduce due to the lack of visibility in this area.