alertmanager feat(dispatch): add start delay

This change adds a new cmd flag --dispatch.start-delay which corresponds to the --rules.alert.resend-delay flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.

By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.

This should help avoid race conditions in inhibitions after a (re)start.

Other improvements:

remove hasFlushed flag from aggrGroup
remove mutex locking from aggrGroup

Nov 06 '25 14:11 siavashs

there's no guarantee that group_wait is the right duration to wait after a restart.

That's what I was thinking as well: some people may even have a group_wait: 1d for low-prio grouped alerts. Then you would never get any alerts if you restarted Alertmanager once a day, right?

Nov 07 '25 08:11 juliusv

I'm dropping the WaitOnStartup as we never used it internally and based on the comments it can be tricky if user uses a long group_wait value.

Nov 07 '25 13:11 siavashs

We are now failing this test which is vague and I remember debugging before but not documenting: https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466

Nov 07 '25 13:11 siavashs

We are now failing this test which is vague and I remember debugging before but not documenting:

https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

Nov 07 '25 13:11 ultrotter

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

So the problem is not duplicate notification but earlier than expected notification:

        interval [2,2.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
          [ ✓ ]
        interval [9,9.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:04.000Z <nil> <nil> { map[alertname:test2]}}[-9.223372036854776e+09:]
          [ ✗ ]

        received:
        @ 2.00549375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        @ 4.009307375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:51.706+01:00 <nil> <nil> { map[alertname:test2]}}[4.003107:]

Nov 07 '25 14:11 siavashs

hasFlashed is dropped now since Dispatcher controls that with status checks after creating an AG. This also means that AGs are now lock-free.

Nov 10 '25 14:11 siavashs

It seems I broke acceptance tests again, I'll check and fix them.

Nov 13 '25 09:11 siavashs

Should we add this one to the v0.30.0 project? Look pretty far along to me

Nov 14 '25 16:11 TheMeier

Should we add this one to the v0.30.0 project? Look pretty far along to me

I think it can make it for that release, all related comments are resolved. I'll wait for a final review by @Spaceman1701 and @ultrotter

Nov 14 '25 18:11 siavashs