alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

feat(dispatch): add start delay

Open siavashs opened this issue 2 months ago • 9 comments

This change adds a new cmd flag --dispatch.start-delay which corresponds to the --rules.alert.resend-delay flag in Prometheus. This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.

By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.

This should help avoid race conditions in inhibitions after a (re)start.

Other improvements:

  • remove hasFlushed flag from aggrGroup
  • remove mutex locking from aggrGroup

siavashs avatar Nov 06 '25 14:11 siavashs

there's no guarantee that group_wait is the right duration to wait after a restart.

That's what I was thinking as well: some people may even have a group_wait: 1d for low-prio grouped alerts. Then you would never get any alerts if you restarted Alertmanager once a day, right?

juliusv avatar Nov 07 '25 08:11 juliusv

I'm dropping the WaitOnStartup as we never used it internally and based on the comments it can be tricky if user uses a long group_wait value.

siavashs avatar Nov 07 '25 13:11 siavashs

We are now failing this test which is vague and I remember debugging before but not documenting: https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466

siavashs avatar Nov 07 '25 13:11 siavashs

We are now failing this test which is vague and I remember debugging before but not documenting:

https://github.com/prometheus/alertmanager/blob/aa879e10d1a254a51e8c6791c7fcec2b9fa530db/test/with_api_v2/acceptance/send_test.go#L422-L466

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

ultrotter avatar Nov 07 '25 13:11 ultrotter

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

So the problem is not duplicate notification but earlier than expected notification:

        interval [2,2.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
          [ ✓ ]
        interval [9,9.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:04.000Z <nil> <nil> { map[alertname:test2]}}[-9.223372036854776e+09:]
          [ ✗ ]

        received:
        @ 2.00549375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        @ 4.009307375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:51.706+01:00 <nil> <nil> { map[alertname:test2]}}[4.003107:]

siavashs avatar Nov 07 '25 14:11 siavashs

hasFlashed is dropped now since Dispatcher controls that with status checks after creating an AG. This also means that AGs are now lock-free.

siavashs avatar Nov 10 '25 14:11 siavashs

It seems I broke acceptance tests again, I'll check and fix them.

siavashs avatar Nov 13 '25 09:11 siavashs

Should we add this one to the v0.30.0 project? Look pretty far along to me

TheMeier avatar Nov 14 '25 16:11 TheMeier

Should we add this one to the v0.30.0 project? Look pretty far along to me

I think it can make it for that release, all related comments are resolved. I'll wait for a final review by @Spaceman1701 and @ultrotter

siavashs avatar Nov 14 '25 18:11 siavashs