Alerts duplication in HA mode
What did you do?
Stack used: multiple vmalert (VictoriaMetrics) and several Alertmanager instances in HA cluster mode.
When there is a large number of alerts, duplicates of some alerts begin to appear on the receivers of Alertmanager.
What did you expect to see?
Stable work of deduplication in HA cluster mode with only one alert received per unique alert from all vmalert instances.
What did you see instead? Under which circumstances?
Some of alerts received multiple times
Environment
- System information:
The issue occurs both on VMs with Debian 12 amd64 and within K8s.
- Alertmanager version:
Found at 0.22.2, reproduced at 0.25.0 too.
- Alertmanager configuration file:
Part of systemd service template in Ansible:
--web.listen-address=":{{ alertmanager_web_listen_port }}" \
--storage.path="{{ alertmanager_bin_dir }}/data" \
--cluster.advertise-address="{{ ansible_default_ipv4.address }}:{{ alertmanager_cluster_listen_port }}" \
--cluster.listen-address="{{ ansible_default_ipv4.address }}:{{ alertmanager_cluster_listen_port }}" \
{% for node in alertmanager_cluster_nodes | sort %}
--cluster.peer="{{ node.split(':')[0] }}:{{ alertmanager_cluster_listen_port }}" \
{% endfor %}
--config.file="{{ alertmanager_conf_dir }}/alertmanager.yml" \
--web.config.file="{{ alertmanager_conf_dir }}/webconfig.yml" \
--log.level="{{ alertmanager_log_level }}"
Hi, I know this issue is quite old - have you continued to see this issue with newer versions of Alertmanager? If so, providing more details about your configuration file and the alerts may help us understand what is happened.
Alertmanager nodes do not communicate alerts between themselves, but they should deduplicate notifications before sending them. We'd expect only one notification per unique alert group.
Hi @Spaceman1701 , Yep, it caused by source code as mentioned in fix #4153
It doesn't depend from configuration files and content of alerts, only on size of notification log state between cluster nodes of product in HA mode
To be clear about my PR and status of code:
- This line of code #1415 causes no deduplication mechanics in HA mode on each node if notification log state from another node (
b) is bigger than 700 bytes (so, no using TCP in that case between nodes) since it was merged (7 years ago) - & it still at latest release same v0.29.0 nflog.go - In addition, there's redundant broadcast all state byte array each time when one of it's element was merged: v0.29.0 nflog.go
I suggest in #4153 to make one broadcast for less network traffic & don't rely on array of notification log state size, 'cause at future flow of code there's decision, how to send to other nodes it - by UDP or TCP based on same size (700 bytes)
In case when there'r several alert producers (for example: 3 VictoriaMetrcis Alert binaries) and 3 nodes of Alertmanager in cluster HA mode, there could be 9 alerts with fairly close time (difference of time at vmalert hosts), and they causes sending notification log state arrays. If Alertmanager nodes don't broadcast between them all duplicated states for deduplication mechanics, there will be a several instead of one notification on receiver from different nodes if this state included in too big binary array and don't pass 700 bytes
Interesting, thanks for the detailed response. I've looked over #4153, and I think I understand what it's looking to fix.
There's still a few bits I don't understand. I think the reason why the comment says oversized messages are "sent to all nodes already" is because the Log unconditionally passes messages to broadcast on calls to Log:
https://github.com/prometheus/alertmanager/blob/6e4e2287be7d989b67948bb0775b6166f07af0db/nflog/nflog.go#L403-L408
However, I think you're right that this can cause some weirdness in the case where the oversized message needs to be merged, and the merge result is different than what was in the message. I think this can only happen if alertmanager is already partitioned, but that might be wrong.
These cases of broadcasting a single message are also unlikely to be oversized - that's probably most push/pull messages where the full state is serialized. I need to understand the semantics of memberlist a little better before I can say for sure if that's the right behavior...
Are you able to consistently reproduce this? It might be worth playing with some of the gossip parameters to see if we can figure out what kind of messaging pattern is causing the problem.
It was consistent before my fix, yep
In case of big amount of incoming alerts from several identical vmalert ('cause of fault-tolerance) to each of Alertmanager nodes led to case that random notification was duplicated on receiver
When I investigated this problem, I'd .pcaps of ingress & egress traffic of all Alertmanager nodes in HA mode, then analyze them by scapy python-lib up to L7 OSI model & JSON with alerts body & notification to receivers, then went to run Alertmanager in HA mode of 3 nodes, one of them was running with debugger. & all of that led me to this lines of code - if Alertmanager nodes received a big amount of alerts, some of them are not broadcasted further & then deduplicated 'cause redundant !cluster.OversizedMessage(b)
When I made a patch to source code & compile it, I test the same case for long time, one of the test result included in PR
Test scheme:
3x vmalert -> 2x (HA mode Alertmanager cluster with 3 nodes each - one cluster without patch, second - with it) -> additional receiver for all notification by modify config in each Alertmanager cluster (receiver was self-developed on python with simple JSON analyze (mention & stats in PR #4153))
So, there were 18 (3*2*3) incoming messages from vmalert, per 9 for each Alertmanager HA mode cluster
& when 9*(big amount of simultaneously firing alerts) came to cluster, one of them, which was without patch, produces 2-3 same notifications instead of only 1
I think I'll have to try to replicate this locally before I can understand more. Upon review of the code, it really does seem like oversized messages are being broadcast correctly unless SendReliable isn't working properly. I'd also expect this problem to happen basically all the time if it were broken - a message that's oversized is always oversized, the speed in which it's received shouldn't matter.
Is it possible that the burst of messages is causing the cluster to briefly enter a split-brain? One way to test this is by increasing --cluster.peet-timeout to a much higher value (e.g. 120s) and measuring the number of duplicated notifications. If this reduces the duplication, it probably means alertmanager is just getting behind in gossiping notification log messages.