alertmanager
alertmanager copied to clipboard
Alertmanagers in HA mode goes sometimes out of sync
I am running two Prometheus which sends alerts to two Alertmanagers running in HA mode and they(Alertmanager) are going out of sync(peering is lost) sometimes and alerts are sent twice.
Sometimes only one Peer is available in /staus
of Alertmanager
-
System information:
Linux 4.14.106-97.85.amzn2.x86_64 x86_64
-
Alertmanager version:
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4
- Alertmanager Args:
--config.file=/etc/alertmanager/alertmanager.yml
--storage.path=/alertmanager
--cluster.peer=prod-prometheus01.prod:9094
--cluster.peer=prod-prometheus02.prod:9094
- Prometheus 1/2 Config:
global:
scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 30s # Default is 10 seconds
external_labels:
prometheus : PROD-STG
replica: prod-prometheus01/02
alerting:
alert_relabel_configs:
- source_labels: [replica]
regex: (.+?)\d+
target_label: replica
alertmanagers:
- static_configs:
- targets:
- prod-prometheus01.prod:9093
- prod-prometheus02.prod:9093
- Alertmanager Log 1:
level=error ts=2020-10-01T16:55:22.073Z caller=api.go:660 component=api version=v2 path=/silence/08efc9d9-a6bf-4d7a-85f1-686f4a720264 method=GET msg="Failed to find silence" err=null id=08efc9d9-a6bf-4d7a-85f1-686f4a720264
level=error ts=2020-10-01T19:38:44.905Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-02T16:47:53.317Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-02T16:47:53.323Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
level=error ts=2020-10-05T06:06:41.295Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
- Alertmanager Log 2:
level=error ts=2020-10-01T16:56:02.180Z caller=api.go:660 component=api version=v2 path=/silence/c855ced3-24e8-481b-9f65-30b3a8b1631e method=GET msg="Failed to find silence" err=null id=c855ced3-24e8-481b-9f65-30b3a8b1631e
level=error ts=2020-10-03T05:26:30.922Z caller=api.go:660 component=api version=v2 path=/silence/4013daf9-9f63-4ac8-955a-be33ab00fef3 method=GET msg="Failed to find silence" err=null id=4013daf9-9f63-4ac8-955a-be33ab00fef3
level=error ts=2020-10-03T05:28:41.603Z caller=api.go:660 component=api version=v2 path=/silence/dffc0882-ea51-45fc-95ca-e6c45dec2a83 method=GET msg="Failed to find silence" err=null id=dffc0882-ea51-45fc-95ca-e6c45dec2a83
level=error ts=2020-10-05T06:06:41.295Z caller=api.go:780 component=api version=v1 msg="API error" err="server_error: context canceled"
Can you try running Alertmanager with --log.level=debug
? It should provide more details about the clustering state. You can also look at the cluster metrics such as:
-
alertmanager_cluster_health_score
-
rate(alertmanager_cluster_peers_left_total[5m])
-
rate(alertmanager_cluster_reconnections_total[5m])
-
rate(alertmanager_cluster_reconnections_failed_total[5m])
-
histogram_quantile(0.9,rate(alertmanager_cluster_pings_seconds_bucket[5m]))
.
alertmanager_cluster_health_score
rate(alertmanager_cluster_peers_left_total[5m])
rate(alertmanager_cluster_reconnections_total[5m])
rate(alertmanager_cluster_reconnections_failed_total[5m])
histogram_quantile(0.9,rate(alertmanager_cluster_pings_seconds_bucket[5m]))
Logs
Alertmanager 1
{"log":"level=debug ts=2020-10-12T03:18:09.355Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [DEBUG] memberlist: Stream connection from=172.25.24.112:47398\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.355533427Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.362Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01EM47SSB0N6HVDH0R8V9KQQSM addr=172.17.0.4:9094\n","stream":"stderr","time":"2020-10-12T03:18:09.36244108Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.362Z caller=cluster.go:470 component=cluster msg=\"peer rejoined\" peer=01EM47SSB0N6HVDH0R8V9KQQSM\n","stream":"stderr","time":"2020-10-12T03:18:09.362457331Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.362Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [WARN] memberlist: Refuting a suspect message (from: 01EM4897KH71PBVEPYRHBJZVBW)\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.362519835Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.450Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [WARN] memberlist: Got ping for unexpected node '01EM47SSB0N6HVDH0R8V9KQQSM' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.450473166Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.860Z caller=dispatch.go:138 component=dispatcher msg=\"Received alert\" alert=InstanceDown[4a91632][active]\n","stream":"stderr","time":"2020-10-12T03:18:09.860807971Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.863Z caller=dispatch.go:138 component=dispatcher msg=\"Received alert\" alert=InstanceDown[4a91632][active]\n","stream":"stderr","time":"2020-10-12T03:18:09.863606785Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.950Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [DEBUG] memberlist: Failed ping: 01EM47SSB0N6HVDH0R8V9KQQSM (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.950447986Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.950Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [DEBUG] memberlist: Stream connection from=172.17.0.4:52498\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.950728175Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.950Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [WARN] memberlist: Got ping for unexpected node 01EM47SSB0N6HVDH0R8V9KQQSM from=172.17.0.4:52498\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.950782835Z"}
{"log":"level=debug ts=2020-10-12T03:18:09.950Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:18:09 [ERR] memberlist: Failed fallback ping: EOF\\n\"\n","stream":"stderr","time":"2020-10-12T03:18:09.950833513Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.450Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [INFO] memberlist: Suspect 01EM47SSB0N6HVDH0R8V9KQQSM has failed, no acks received\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.450692075Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.450Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [WARN] memberlist: Got ping for unexpected node '01EM47SSB0N6HVDH0R8V9KQQSM' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.450777291Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.950Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [DEBUG] memberlist: Failed ping: 01EM47SSB0N6HVDH0R8V9KQQSM (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.950799929Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.951Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [DEBUG] memberlist: Stream connection from=172.17.0.4:42002\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.951135054Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.951Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [WARN] memberlist: Got ping for unexpected node 01EM47SSB0N6HVDH0R8V9KQQSM from=172.17.0.4:42002\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.951253336Z"}
{"log":"level=debug ts=2020-10-12T03:20:58.951Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:58 [ERR] memberlist: Failed fallback ping: EOF\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:58.951346577Z"}
Alertmanager 2
{"log":"level=debug ts=2020-10-12T03:06:09.356Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:09 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.9.134:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:09.358369617Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.365Z caller=delegate.go:230 component=cluster received=NotifyJoin node=01EM4897KH71PBVEPYRHBJZVBW addr=172.17.0.4:9094\n","stream":"stderr","time":"2020-10-12T03:06:09.36515147Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.365Z caller=cluster.go:470 component=cluster msg=\"peer rejoined\" peer=01EM4897KH71PBVEPYRHBJZVBW\n","stream":"stderr","time":"2020-10-12T03:06:09.365170926Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.369Z caller=cluster.go:441 component=cluster msg=refresh result=success addr=172.25.9.134:9094\n","stream":"stderr","time":"2020-10-12T03:06:09.369281828Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.371Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:09 [DEBUG] memberlist: Stream connection from=172.17.0.1:49618\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:09.37118996Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.371Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:09 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.24.112:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:09.37131582Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.381Z caller=cluster.go:441 component=cluster msg=refresh result=success addr=172.25.24.112:9094\n","stream":"stderr","time":"2020-10-12T03:06:09.381528838Z"}
{"log":"level=debug ts=2020-10-12T03:06:09.863Z caller=dispatch.go:138 component=dispatcher msg=\"Received alert\" alert=InstanceDown[3bd5384][active]\n","stream":"stderr","time":"2020-10-12T03:06:09.863937717Z"}
{"log":"level=debug ts=2020-10-12T03:06:10.321Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:10 [WARN] memberlist: Got ping for unexpected node '01EM4897KH71PBVEPYRHBJZVBW' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:10.321799816Z"}
{"log":"level=debug ts=2020-10-12T03:06:10.821Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:10 [DEBUG] memberlist: Failed ping: 01EM4897KH71PBVEPYRHBJZVBW (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:10.82188661Z"}
{"log":"level=debug ts=2020-10-12T03:06:10.821Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:10 [DEBUG] memberlist: Stream connection from=172.17.0.4:46186\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:10.822035699Z"}
{"log":"level=debug ts=2020-10-12T03:06:10.822Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:06:10 [WARN] memberlist: Got ping for unexpected node 01EM4897KH71PBVEPYRHBJZVBW from=172.17.0.4:46186\\n\"\n","stream":"stderr","time":"2020-10-12T03:06:10.822187248Z"}
{"log":"level=debug ts=2020-10-12T03:20:55.321Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:55 [WARN] memberlist: Got ping for unexpected node '01EM4897KH71PBVEPYRHBJZVBW' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:55.321819879Z"}
{"log":"level=debug ts=2020-10-12T03:20:55.821Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:55 [DEBUG] memberlist: Failed ping: 01EM4897KH71PBVEPYRHBJZVBW (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:55.821787099Z"}
{"log":"level=debug ts=2020-10-12T03:20:55.821Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:55 [DEBUG] memberlist: Stream connection from=172.17.0.4:36124\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:55.822010152Z"}
{"log":"level=debug ts=2020-10-12T03:20:55.822Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:55 [WARN] memberlist: Got ping for unexpected node 01EM4897KH71PBVEPYRHBJZVBW from=172.17.0.4:36124\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:55.822136644Z"}
{"log":"level=debug ts=2020-10-12T03:20:55.822Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/12 03:20:55 [ERR] memberlist: Failed fallback ping: EOF\\n\"\n","stream":"stderr","time":"2020-10-12T03:20:55.822236033Z"}
Alertmanager is actually running on 9093(HTTP) & 9094 is used for peering(GRPC). Docker is given the host network.
sh-4.2# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a63c43ea310 cdn-registry-1.docker.io/prom/alertmanager:v0.21.0 "/bin/alertmanager -…" 3 days ago Up 3 days 0.0.0.0:9093-9094->9093-9094/tcp alertmanager
I am suspecting that the alertmanager configuration needs IPs instead of CNAME.
Alertmanager Args:
--cluster.peer=prod-prometheus01.prod:9094
--cluster.peer=prod-prometheus02.prod:9094
Are you sure that you've exposed the UDP port too as mentioned in the README.md?
I have not mentioned --cluster.listen-address
in the arguments for the container so it'll use the default value of "0.0.0.0:9094".
I have exposed docker container 9093, 9094 port with host networking.
Alermanagers Args
--config.file=/etc/alertmanager/alertmanager.yml
--storage.path=/alertmanager
--cluster.peer=prod-prometheus01.prod:9094
--cluster.peer=prod-prometheus02.prod:9094
Also, if you check the peer IPs it's taking the docker IP instead of the host.
Alertmanager resolves the IP addresses of prod-prometheus02.prod
and prod-prometheus02.prod
which is why you see IP addresses in the UI.
I have not enabled UDP port on 9094. But now I have enabled TCP as well UDP now on PORT 9094.
cdn-registry-1.docker.io/prom/alertmanager:v0.21.0 0.0.0.0:9093-9094->9093-9094/tcp, 0.0.0.0:9094->9094/udp alertmanager
But still, logs have [ERR] memberlist: Failed fallback ping: EOF
alertmanager_cluster_health_score
is 7 which same as the previous value.
rate(alertmanager_cluster_peers_left_total[5m])
rate(alertmanager_cluster_reconnections_total[5m])
rate(alertmanager_cluster_reconnections_failed_total[5m])
is 0 & histogram_quantile(0.9,rate(alertmanager_cluster_pings_seconds_bucket[5m]))
is "No data found" which same as the previous value.
Alertmanager1 logs, IP: 172.25.9.134, Continaer ID: 8f2b987e14d9
{"log":"level=debug ts=2020-10-27T16:21:46.769Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:46 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.9.134:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:46.769950851Z"}
{"log":"level=debug ts=2020-10-27T16:21:46.770Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:46 [DEBUG] memberlist: Stream connection from=172.17.0.1:32886\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:46.770142591Z"}
{"log":"level=debug ts=2020-10-27T16:21:46.781Z caller=cluster.go:411 component=cluster msg=reconnect result=success peer= addr=172.25.9.134:9094\n","stream":"stderr","time":"2020-10-27T16:21:46.781497879Z"}
{"log":"level=debug ts=2020-10-27T16:21:46.782Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:46 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.24.112:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:46.782169604Z"}
{"log":"level=debug ts=2020-10-27T16:21:46.797Z caller=cluster.go:411 component=cluster msg=reconnect result=success peer= addr=172.25.24.112:9094\n","stream":"stderr","time":"2020-10-27T16:21:46.797555617Z"}
{"log":"level=debug ts=2020-10-27T16:21:47.723Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:47 [INFO] memberlist: Suspect 01ENNA4WVGRQ1ZGBQH65VW7DN2 has failed, no acks received\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:47.72390972Z"}
{"log":"level=debug ts=2020-10-27T16:21:48.723Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:48 [WARN] memberlist: Got ping for unexpected node '01ENNA4WVGRQ1ZGBQH65VW7DN2' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:48.723984556Z"}
{"log":"level=debug ts=2020-10-27T16:21:48.948Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:48 [DEBUG] memberlist: Stream connection from=172.25.24.112:40818\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:48.948394258Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.223Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [DEBUG] memberlist: Failed ping: 01ENNA4WVGRQ1ZGBQH65VW7DN2 (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.224017699Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.224Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [DEBUG] memberlist: Stream connection from=172.17.0.4:40536\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.22441859Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.224Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [WARN] memberlist: Got ping for unexpected node 01ENNA4WVGRQ1ZGBQH65VW7DN2 from=172.17.0.4:40536\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.224439704Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.224Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [ERR] memberlist: Failed fallback ping: EOF\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.2244861Z"}
{"log":"level=debug ts=2020-10-27T16:21:51.155Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:51 [DEBUG] memberlist: Initiating push/pull sync with: 01ENNA4WVGRQ1ZGBQH65VW7DN2 172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:51.155570631Z"}
{"log":"level=debug ts=2020-10-27T16:21:51.155Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:51 [DEBUG] memberlist: Stream connection from=172.17.0.4:40644\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:51.155890742Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.718Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:52 [DEBUG] memberlist: Stream connection from=172.25.24.112:32892\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:52.719021779Z"}
Alertmanager2 logs, IP: 172.25.24.112, Continaer ID: e690fcef574e
{"log":"level=debug ts=2020-10-27T16:21:46.784Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:46 [DEBUG] memberlist: Stream connection from=172.25.9.134:58500\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:46.784142677Z"}
{"log":"level=debug ts=2020-10-27T16:21:47.689Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:47 [INFO] memberlist: Suspect 01ENNABQMKCGFPC85421Q0HXTY has failed, no acks received\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:47.689328365Z"}
{"log":"level=debug ts=2020-10-27T16:21:48.689Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:48 [WARN] memberlist: Got ping for unexpected node '01ENNABQMKCGFPC85421Q0HXTY' from=172.17.0.4:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:48.689442026Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.189Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [DEBUG] memberlist: Failed ping: 01ENNABQMKCGFPC85421Q0HXTY (timeout reached)\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.189485438Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.189Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [DEBUG] memberlist: Stream connection from=172.17.0.4:40082\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.189821219Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.189Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [WARN] memberlist: Got ping for unexpected node 01ENNABQMKCGFPC85421Q0HXTY from=172.17.0.4:40082\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.189961178Z"}
{"log":"level=debug ts=2020-10-27T16:21:49.189Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:49 [ERR] memberlist: Failed fallback ping: EOF\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:49.189974438Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.720Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:52 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.9.134:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:52.720270957Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.742Z caller=cluster.go:411 component=cluster msg=reconnect result=success peer= addr=172.25.9.134:9094\n","stream":"stderr","time":"2020-10-27T16:21:52.743107389Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.743Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:52 [DEBUG] memberlist: Stream connection from=172.17.0.1:44018\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:52.75201922Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.743Z caller=cluster.go:306 component=cluster memberlist=\"2020/10/27 16:21:52 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.24.112:9094\\n\"\n","stream":"stderr","time":"2020-10-27T16:21:52.752144866Z"}
{"log":"level=debug ts=2020-10-27T16:21:52.753Z caller=cluster.go:411 component=cluster msg=reconnect result=success peer= addr=172.25.24.112:9094\n","stream":"stderr","time":"2020-10-27T16:21:52.753374079Z"}
Docker inspect
of one of the Alertmangers.
"Args": [
"--config.file=/etc/alertmanager/alertmanager.yml",
"--storage.path=/alertmanager",
"--cluster.peer=prod-prometheus01.prod:9094",
"--cluster.peer=prod-prometheus02.prod:9094",
"--web.external-url=XXXXXXXXXXXXXXXXXX",
"--log.level=debug"
],
.
.
.
"NetworkMode": "bridge",
"PortBindings": {
"9093/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "9093"
}
],
"9094/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "9094"
}
],
"9094/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "9094"
}
]
},
After opening UDP port on 9094 along with TCP and allowing UDP access in Security Groups of AWS for port 9094.
Still, we are facing a clustering issue on one of the nodes.
I am suspecting because of Docker's "NetworkMode": "bridge" and container IP something is going wrong. As cluster.peer
I have given hostnames.
CC: @simonpasquier
Hmm, have you tried looking at the ping latency (alertmanager_cluster_pings_seconds_bucket
)? Maybe you need to tweak some of the --cluster.*
flags.
solved? @shubhamc183
I have a similar problem.
Yesterday I temporarily defeated it by setting parameters with such values:
cluster.probe-timeout=1s and cluster.probe-interval=2s.
But in the evenings the errors returned.
I use docker, run on all nodes with the same parameters.
command:
- '--cluster.probe-timeout=1s'
- '--cluster.probe-interval=2s'
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
- '--cluster.advertise-address=:9094'
- '--cluster.peer=10.xx.xx.16:9094'
- '--cluster.peer=10.xx.xx.17:9094'
- '--cluster.peer=10.xx.xx.18:9094'
- '--cluster.listen-address=0.0.0.0:9094'
ports:
- 9093:9093
- 9094:9094
- 9094:9094/udp
# netstat -lntup | grep 9094
tcp 0 0 0.0.0.0:9094 0.0.0.0:* LISTEN 1413016/docker-prox
tcp6 0 0 :::9094 :::* LISTEN 1413023/docker-prox
udp 0 0 0.0.0.0:9094 0.0.0.0:* 1413037/docker-prox
udp6 0 0 :::9094 :::* 1413043/docker-prox
# netstat -lntup | grep 9093
tcp 0 0 0.0.0.0:9093 0.0.0.0:* LISTEN 1413058/docker-prox
tcp6 0 0 :::9093 :::* LISTEN 1413064/docker-prox
Alertmanager version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d
I am running Alertmanager HA in 2 different VMs and I set the cluster.advertise-address as the IP address with port 9094, gossip is quiet afterwards.
I am running Alertmanager HA in 2 different VMs and I set the cluster.advertise-address as the IP address with port 9094, gossip is quiet afterwards.
I can confirm that setting the cluster.advertise-address flag made our alertmanager health metrics happy as well. Before hard-coding the host IP, it was advertising the container ip addresses for the HA cluster.
args:
- --config.file=/etc/alertmanager-config/omc-alertmanager.yaml
- --storage.path=/alertmanager
- --cluster.listen-address=0.0.0.0:9094
- --cluster.advertise-address=$(POD_IP):9094
- --cluster.peer=omc-alertmanager-0.omc-alertmanager-service.omc:9094
- --cluster.peer=omc-alertmanager-1.omc-alertmanager-service.omc:9094
- --cluster.peer-timeout=180s
- --log.level=info
env:
- name: POD_IP
valueFrom:
`fieldRef:`
fieldPath: status.podIP
k8s部署时出现集群抖动现象,实测这样配置cluster.advertise-address可以解决
args: - --config.file=/etc/alertmanager-config/omc-alertmanager.yaml - --storage.path=/alertmanager - --cluster.listen-address=0.0.0.0:9094 - --cluster.advertise-address=$(POD_IP):9094 - --cluster.peer=omc-alertmanager-0.omc-alertmanager-service.omc:9094 - --cluster.peer=omc-alertmanager-1.omc-alertmanager-service.omc:9094 - --cluster.peer-timeout=180s - --log.level=info env: - name: POD_IP valueFrom: `fieldRef:` fieldPath: status.podIP
k8s部署时出现集群抖动现象,实测这样配置cluster.advertise-address可以解决
这样配置HA集群确实不抖动了,但是告警还是两个pod依次发送,不能严格的按照5分钟来重复发送,一下是alertmanager配置 group_wait: 30s group_interval: 5m repeat_interval: 5m
@Gaozizhong It seems you are trying to configure it in K8S env, isn't it? Then the --cluster.peer
should be the headless service , by which it will return the internal IP of Alertmanager pod, and you don't have to configure twice.
@Gaozizhong It seems you are trying to configure it in K8S env, isn't it? Then the
--cluster.peer
should be the headless service , by which it will return the internal IP of Alertmanager pod, and you don't have to configure twice.
@duj4 您说的 headless service是这样配置吗?
args: - --config.file=/etc/alertmanager-config/omc-alertmanager.yaml - --storage.path=/alertmanager - --cluster.listen-address=0.0.0.0:9094 - --cluster.advertise-address=$(HOSTNAME).omc-alertmanager-service.omc:9094 - --cluster.peer=omc-alertmanager-0.omc-alertmanager-service.omc:9094 - --cluster.peer=omc-alertmanager-1.omc-alertmanager-service.omc:9094 - --cluster.peer-timeout=180s - --log.level=info env: - name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name
通过测试通过POD_IP也是可以的,这个配置中有重复的配置吗?哪些是重复的呢?