alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

change the deduplication id for the sns receiver

Open qinxx108 opened this issue 2 years ago • 4 comments

Change the deduplication id to use groupKey + now from the context. now is generated from https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go#L442 which should be different from each flush.

This change should be fix the following cases:

  1. The users are setting repeat_interval to be less than 5m, the messages are getting deduplicated by the SNS even though users want to receive the message less than 5m interval
  2. The users unable to receive the message from the alerts get resolved in 5m

qinxx108 avatar Apr 26 '22 00:04 qinxx108

I suspect that this will break alertmanager HA functionnality, as Now will not be unique accross the cluster.

roidelapluie avatar May 03 '22 09:05 roidelapluie

I suspect that this will break alertmanager HA functionnality, as Now will not be unique accross the cluster.

I'm not sure this would be the case, the dedup stage (where we look at the notification log) happens before the retry stage (where we execute the Notify function) - by the time we get to use now, we've already determined that we need to notify.

https://github.com/prometheus/alertmanager/blob/a38c5b8f1d780ce042a53a217af8c56316ed3071/notify/notify.go#L359-L365

In principle, the change seems safe. WDYT @roidelapluie?

@qinxx108 is there any chance you can provide us with a test account? I feel like to review/approve this change we'd need to test it against SNS.

gotjosh avatar May 03 '22 09:05 gotjosh

@roidelapluie @gotjosh Based on our previous discussion, Wonder if we have a chance to test this out? Thanks a lot for the help!

qinxx108 avatar Jun 20 '22 16:06 qinxx108

I've tested this by creating a receiver that has both a webhook and SNS config with a repeat interval of 1m.

For the webhook, I get new webhooks every minute and for SNS I get new message_id and sequence for each send. This was not the same when I tried it out without this change.

ts=2022-06-30T16:34:08.710Z caller=sns.go:94 level=debug integration=sns msg="SNS message successfully published" message_id=ea06867b-379c-5469-8c6c-dd4ce55e260a sequencenumber=10000000000000019000
ts=2022-06-30T16:35:38.416Z caller=sns.go:94 level=debug integration=sns msg="SNS message successfully published" message_id=06b00bb4-0eb2-5913-acc2-5083ddae2675 sequencenumber=10000000000000020000

gotjosh avatar Jun 30 '22 16:06 gotjosh