alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Fix repeated resolved alerts

Open damnever opened this issue 2 years ago • 13 comments

Assumptions:

  1. Prometheus/XXRuler failed to send some active alerts, but succeeded to send all resolved alerts: alert-a: T1(active:notify-ok), T2(resolved:notify-ok), T3(active:notify-failed), T4(resolved:notify-ok).
  2. Or maybe the alertmanager is behind a proxy with some retry policy.

Reproduce:

  1. A repeated alert-a[resolved] is received during aggrGroup.flush.
  2. We can not delete alert-a[resolved] since the alert has been changed.
  3. We receive a new alert-b[firing] during group_interval, then both alert-a[resolved] and alert-b[firing] will be sent out.
  4. Users may receive alert-a[resolved] multiple times.

There is another case that may trigger this issue, if the time is right, the repeated alert-a[resolved] is received right after aggrGroup gets deleted.

damnever avatar Jul 15 '22 13:07 damnever

I think this might be on purpose and written for simplicity. Alertmanager has at least once promise, it's better to send one extra alert than miss one.

roidelapluie avatar Jul 18 '22 08:07 roidelapluie

@roidelapluie Yes, alertmanager has at least once promise. I think that is because alertmanager will retry until the notification success, then remove those resolved alerts.

But in this case, we are sending repeated(orphan) resolved alerts, and there are race conditions in the code, I'd consider this as a BUG ~~rather than a "feature"~~.

damnever avatar Jul 18 '22 08:07 damnever

#2474 could be the case.

damnever avatar Jul 19 '22 15:07 damnever

@roidelapluie @simonpasquier @w0rm Please take a look in detail.

damnever avatar Jul 26 '22 02:07 damnever

After reviewing the Alertmanager code and history, I kinda agree that there's room for improvement. As far as I can tell while https://github.com/prometheus/alertmanager/pull/2040 attempted to fix a race between the aggregation group and the alert store (e.g. a resolved alert could be removed before being notified) by disabling garbage-collection in the alerts store, it introduced a side effect that increased the likelihood of receiving several notifications with the same resolved alert.

Any change in the current logic would need careful attention though IMHO. Ideally a unit test highlighting the current issue in dispatch.aggrGroup would be a good starting point.

simonpasquier avatar Oct 03 '22 14:10 simonpasquier

@simonpasquier I have added test cases for this issue, the main branch will fail the test:

gotest -v -race -run TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted
    dispatch_test.go:743: unexpected repeated resolved alerts
=== RUN   TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush
    dispatch_test.go:773: unexpected repeated resolved alerts
--- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts (12.01s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_deleted (0.00s)
    --- FAIL: TestDispatcherReceiveAndNotifyRepeatedResolvedAlerts/repeated_alerts_after_aggrGroup_flush (12.00s)
FAIL
exit status 1
FAIL    github.com/prometheus/alertmanager/dispatch     13.180s

damnever avatar Oct 08 '22 07:10 damnever

Ping someone..

damnever avatar Nov 03 '22 09:11 damnever

@damnever Hi! Apologize for asking, but could you please share some debug log here? I'm supposably having the same problem.

bo-er avatar Apr 14 '23 06:04 bo-er

@bo-er I do not have any logs to reproduce the original issue since we are currently using this patch in our production environment. However, I can provide some logs that were generated by this patch:

... msg="ignore resolved alert since there is no corresponding record in the store"

damnever avatar Apr 14 '23 06:04 damnever

@simonpasquier, could you please take another look at this? We have been running this in our production environment for over a year. I am confident that it has not introduced any side effects. It has resolved the issues I previously mentioned.

damnever avatar Dec 06 '23 11:12 damnever