emails are resent on config reload if group contents has changed
What did you do?
Write a route with a group match, of the form:
group_by:
- alertname
group_wait: 10s
group_interval: 1h
repeat_interval: 4h
Perform these operations, in order:
amtool alert add alertname=test tag=1
sleep 11
# observe the notification being sent
amtool alert add alertname=test tag=2
killall -HUP alertmanager
What did you expect to see?
One hour later, notifications should be resent, when group_interval expires.
What did you see instead? Under which circumstances?
Notifications for tag=1 and tag=2 are resent immediately when SIGHUP is received.
This is notably specific to the email receiver, because the other ones can deduplicate on group_key, which hides the problem.
Environment
- Alertmanager version:
0.21
So, here's what's going on:
Grouping is implemented in the dispatcher, while deduplication is implemented in the notification pipeline. The notification pipeline is stateful, via nflog. The dispatcher only keeps its state in memory.
Config reloading is implemented by completely stopping the dispatcher and constructing a new one from config. This also clears all aggregation group state.
When the config is reloaded, the new dispatcher recreates all the aggregation groups from the in-memory alert state. This results in dispatching all the events into the notification pipeline again.
If the set of alerts is unchanged since the last time this occurred, the DedupStage in the notification pipeline will suppress them, based on the contents of nflog, respecting repeat_interval as expected. However, if there has been any change to the set of alerts in the group, then DedupStage can't handle that and the notifications will be sent to the receiver.
If the receiver is anything other than email, the group key and timestamps will not have changed so the duplicate notifications can be suppressed. The email receiver is unable to do this (because it's email...) so duplicate emails are sent.
Half of this story is taken up by explaining why we don't see this happen all the time, but the grouping functionality doesn't really work across config reloads, and probably also behaves badly across cluster gossip. This isn't noticed in most configurations because the other features are effective at hiding the problem.
I'm thinking that the correct solution here would be for aggregation grouping to be implemented in the notification pipeline based on nflog. However, a crude and immediate solution is likely achievable by having the dispatcher query nflog when creating an aggregation group, and adjusting its timings if this group key was successfully sent in the last group_interval; this would solve the config reload case, but I'm not sure it's the right way to go in the presence of cluster gossip.
Opinions?
What about group_wait being larger than the time between config reloads? Lowering the aggregation pipeline seems feasible but also propagating the state from the old to the new dispatcher.