alertmanager Fix repeated resolved alerts

Assumptions:

Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts: alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
Or maybe the alertmanager is behind a proxy with some retry policy.

Reproduce:

A repeated alert-a[resolved] is received during aggrGroup.flush.
We can not delete alert-a[resolved] since the alert has been changed.
We receive a new alert-b[firing] during group_interval, then both alert-a[resolved] and alert-b[firing] will be sent out.
Users may receive alert-a[resolved] multiple times.

There is another case that may trigger this issue, if the time is right, the repeated alert-a[resolved] is received right after aggrGroup gets deleted.

Jul 15 '22 13:07 damnever

I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one.

Jul 18 '22 08:07 roidelapluie

@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts.

But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG ~~rather than a "feature"~~.

Jul 18 '22 08:07 damnever

#2474 could be the case.

Jul 19 '22 15:07 damnever

@roidelapluie @simonpasquier @w0rm Please take a look in detail.

Jul 26 '22 02:07 damnever

After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while https://github.com/prometheus/alertmanager/pull/2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert.

Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in dispatch.aggrGroup would be a good starting point.

Oct 03 '22 14:10 simonpasquier

@simonpasquier I have added test cases for this issue, the main branch will fail the test:

gotest -v -race -run TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted
    dispatch_test.go:743: unexpected repeated resolved alerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush
    dispatch_test.go:773: unexpected repeated resolved alerts
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts (12.01s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted (0.00s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush (12.00s)
FAIL
exit status 1
FAIL    github.com/prometheus/alertmanager/dispatch     13.180s

Oct 08 '22 07:10 damnever

Ping someone..

Nov 03 '22 09:11 damnever

@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem.

Apr 14 '23 06:04 bo-er

@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:

... msg="ignore resolved alert since there is no corresponding record in the store"

Apr 14 '23 06:04 damnever

@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned.

Dec 06 '23 11:12 damnever

alertmanager alertmanager copied to clipboard

Fix repeated resolved alerts

alertmanager
alertmanager copied to clipboard