fix: alerts duplication in HA cluster mode
#4152
There is fix for duplicates in HA cluster mode & reduce for broadcasting several times exactly same state.
!cluster.OversizedMessage(b) is not necessary here:
On current release version (0.27.0)
- Using broadcast
- Setup broadcast func
- Call of SetBroadcast
- Setup Channel Broadcast
- Inside of AddState
- Broadcast func for Channel
- Choose way for send data in func
- Small state broadcasting
- Big state broadcasting
To sum up: at step Choose way for send data in func there's same check with choosing way of sending data - TCP or UDP
For testing this fix, stand was created:
- Several vmalert instances send alerts to 2 HA clusters of Alertmanager - with patch & without it
- Each Alertmanager cluster know only about his own nodes
- Both clusters of Alertmanager send to one self-developed receiver
- Receiver calculates duplicates of alerts with delta < 1 minute and other stats
Results of tests with big by body incoming alerts:
{
"main": {
"alerts_last_minute": 2,
"alerts_total": 10864,
"duplicates_total": 2887
},
"patched": {
"alerts_last_minute": 2,
"alerts_total": 7579,
"duplicates_total": 0
}
}
I stumbled across this fix because we were sometimes duplicating notifications in Slack. Are there any plans to review this change?
Hi, thanks for the contribution. Do you think you could add a unit test that demonstrates the old bug and guards against future regressions of this issue?
Hi, @jan--f
It's a bit complicated - during finding this, I run 3 nodes of Alertmanager, one of them in debugger
It's appears only on cluster mode with big notification log size between nodes
!cluster.OversizedMessage(b) is not necessary here
Iiuc the original code avoids gossiping oversized messages since they have already been distributed. So won't this change create a dupplicate gossip message for oversized messages?
a dupplicate gossip message for oversized messages
Currently, there's no sync between nodes with big messages (reason of duplication on receiver) and flood by broadcasting n times other logs (n - how much logs there's in one announce between nodes)
In my solution, all logs will be announced further once