feat(dispatch): add start delay
This change adds a new cmd flag --dispatch.start-delay which corresponds to the --rules.alert.resend-delay flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.
By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.
This should help avoid race conditions in inhibitions after a (re)start.
Other improvements:
- remove hasFlushed flag from aggrGroup
- remove mutex locking from aggrGroup
there's no guarantee that
group_waitis the right duration to wait after a restart.
That's what I was thinking as well: some people may even have a group_wait: 1d for low-prio grouped alerts. Then you would never get any alerts if you restarted Alertmanager once a day, right?
I'm dropping the WaitOnStartup as we never used it internally and based on the comments it can be tricky if user uses a long group_wait value.
We are now failing this test which is vague and I remember debugging before but not documenting: https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466
We are now failing this test which is vague and I remember debugging before but not documenting:
https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466
Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?
Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?
So the problem is not duplicate notification but earlier than expected notification:
interval [2,2.5]
---
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
[ ✓ ]
interval [9,9.5]
---
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:04.000Z <nil> <nil> { map[alertname:test2]}}[-9.223372036854776e+09:]
[ ✗ ]
received:
@ 2.00549375
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
@ 4.009307375
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
- &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:51.706+01:00 <nil> <nil> { map[alertname:test2]}}[4.003107:]
hasFlashed is dropped now since Dispatcher controls that with status checks after creating an AG.
This also means that AGs are now lock-free.
It seems I broke acceptance tests again, I'll check and fix them.
Should we add this one to the v0.30.0 project? Look pretty far along to me
Should we add this one to the v0.30.0 project? Look pretty far along to me
I think it can make it for that release, all related comments are resolved. I'll wait for a final review by @Spaceman1701 and @ultrotter