alertmanager Fixes early notifications after config reload (#2492)

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring.

One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail.

I considered the following ways to fix it:

1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done.

I started with #3 as it seemeded to have the fewest downsides with things like drift.

needsUpdate is based on: https://github.com/prometheus/alertmanager/pull/3074#issuecomment-1508869884

May 05 '24 09:05 zecke

I was trying to find an answer of why we set hasFlushed before we actually started flushing (leave alone finished it).

The flag was introduced in (#1301) to mitigate duplicate notifications by avoiding to execute ag.run() more frequently.

With this change the problem is solved inside the DedupStage. A ag.next.Reset(0) will flush the group early but as it happens before the groupInterval should not lead to extra notifications. The only visible impact should be jitter. If this matters we can replace the timer reset with a write into a channel.

May 05 '24 09:05 zecke

Any chance we can make progress here?

Sep 11 '24 12:09 zecke