alertmanager
alertmanager copied to clipboard
Fix repeated resolved alerts
Assumptions:
- Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts: alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
- Or maybe the alertmanager is behind a proxy with some retry policy.
Reproduce:
- A repeated
alert-a[resolved]
is received during aggrGroup.flush. - We can not delete
alert-a[resolved]
since the alert has been changed. - We receive a new
alert-b[firing]
duringgroup_interval
, then bothalert-a[resolved]
andalert-b[firing]
will be sent out. - Users may receive
alert-a[resolved]
multiple times.
There is another case that may trigger this issue, if the time is right, the repeated alert-a[resolved]
is received right after aggrGroup gets deleted.
I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one.
@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts.
But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG ~~rather than a "feature"~~.
#2474 could be the case.
@roidelapluie @simonpasquier @w0rm Please take a look in detail.
After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while https://github.com/prometheus/alertmanager/pull/2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert.
Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in dispatch.aggrGroup
would be a good starting point.
@simonpasquier I have added test cases for this issue, the main branch will fail the test:
gotest -v -race -run TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted
dispatch_test.go:743: unexpected repeated resolved alerts
=== RUN TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush
dispatch_test.go:773: unexpected repeated resolved alerts
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts (12.01s)
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted (0.00s)
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush (12.00s)
FAIL
exit status 1
FAIL github.com/prometheus/alertmanager/dispatch 13.180s
Ping someone..
@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem.
@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:
... msg="ignore resolved alert since there is no corresponding record in the store"
@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned.