alertmanager fix: alerts duplication in HA cluster mode

#4152

There is fix for duplicates in HA cluster mode & reduce for broadcasting several times exactly same state.

!cluster.OversizedMessage(b) is not necessary here:

On current release version (0.27.0)

To sum up: at step Choose way for send data in func there's same check with choosing way of sending data - TCP or UDP

For testing this fix, stand was created:

Several vmalert instances send alerts to 2 HA clusters of Alertmanager - with patch & without it
Each Alertmanager cluster know only about his own nodes
Both clusters of Alertmanager send to one self-developed receiver
Receiver calculates duplicates of alerts with delta < 1 minute and other stats

Results of tests with big by body incoming alerts:

{
  "main": {
    "alerts_last_minute": 2,
    "alerts_total": 10864,
    "duplicates_total": 2887
  },
  "patched": {
    "alerts_last_minute": 2,
    "alerts_total": 7579,
    "duplicates_total": 0
  }
}

Dec 05 '24 10:12 nip-was-here

I stumbled across this fix because we were sometimes duplicating notifications in Slack. Are there any plans to review this change?

Dec 17 '24 16:12 MarcWort

Hi, thanks for the contribution. Do you think you could add a unit test that demonstrates the old bug and guards against future regressions of this issue?

Jan 16 '25 16:01 jan--f

Hi, @jan--f

It's a bit complicated - during finding this, I run 3 nodes of Alertmanager, one of them in debugger

It's appears only on cluster mode with big notification log size between nodes

Jan 16 '25 16:01 nip-was-here

!cluster.OversizedMessage(b) is not necessary here

Iiuc the original code avoids gossiping oversized messages since they have already been distributed. So won't this change create a dupplicate gossip message for oversized messages?

Jan 16 '25 16:01 jan--f

a dupplicate gossip message for oversized messages

Currently, there's no sync between nodes with big messages (reason of duplication on receiver) and flood by broadcasting n times other logs (n - how much logs there's in one announce between nodes)

Jan 16 '25 16:01 nip-was-here

In my solution, all logs will be announced further once

Jan 16 '25 16:01 nip-was-here