alertmanager reliably crashes on every boot
What did you do?
Every time when booting, alertmanager errors out (but works when starting later).
Environment
- System information:
Debian bookworm, Linux 6.1.0-27-amd64 x86_64
- Alertmanager version:
alertmanager, version 0.25.0 (branch: debian/sid, revision: 0.25.0-1+b4)
build user: [email protected]
build date: 20230409-09:50:43
go version: go1.19.8
platform: linux/amd64
- Prometheus version:
prometheus, version 2.42.0+ds (branch: debian/sid, revision: 2.42.0+ds-5+b5)
build user: [email protected]
build date: 20230518-08:49:35
go version: go1.19.8
platform: linux/amd64
- Logs:
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[616]: ts=2024-11-19T22:43:43.564Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[616]: ts=2024-11-19T22:43:43.569Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 1.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[1141]: ts=2024-11-19T22:43:43.845Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:43 lcg-lrz-monitor prometheus-alertmanager[1141]: ts=2024-11-19T22:43:43.846Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:43 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 2.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1393]: ts=2024-11-19T22:43:44.249Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1393]: ts=2024-11-19T22:43:44.250Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 3.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1914]: ts=2024-11-19T22:43:44.665Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:44 lcg-lrz-monitor prometheus-alertmanager[1914]: ts=2024-11-19T22:43:44.666Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:44 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 4.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Started prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor prometheus-alertmanager[2017]: ts=2024-11-19T22:43:45.093Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Nov 19 23:43:45 lcg-lrz-monitor prometheus-alertmanager[2017]: ts=2024-11-19T22:43:45.094Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Scheduled restart job, restart counter is at 5.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Stopped prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Start request repeated too quickly.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Failed with result 'exit-code'.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Failed to start prometheus-alertmanager.service - Alertmanager for prometheus.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: prometheus-alertmanager.service: Triggering OnFailure= dependencies.
Nov 19 23:43:45 lcg-lrz-monitor systemd[1]: Starting [email protected] - send systemd unit status via email to `root`...
Nov 19 23:43:47 lcg-lrz-monitor systemd[1]: [email protected]: Deactivated successfully.
Nov 19 23:43:47 lcg-lrz-monitor systemd[1]: Finished [email protected] - send systemd unit status via email to `root`.
Nov 19 23:44:11 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:11.659Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:44:11 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:11.660Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:44:53 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:44:53.930Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=9 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:45:21 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:45:21.648Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:45:21 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:45:21.650Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:31 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:31.648Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:31 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:31.650Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=2 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
Nov 19 23:46:53 lcg-lrz-monitor prometheus[621]: ts=2024-11-19T22:46:53.930Z caller=notifier.go:532 level=error component=notifier alertmanager=http://localhost:9093/api/v2/alerts count=9 msg="Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refused"
So it seems there are a number of errors involved here:
couldn't deduce an advertise address: no private IP found, explicit advertise addr not providedcreate memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided
Not really sure what it means by "private IP" (or why it should need any), any normal UNIX daemon typically binds to the wildcard address if no specific bind addresses are given.
Also, the service is pulled in by multi-user.target and at that time any networking (including the statically configured global IPs) are long up.
Error sending alert" err="Post \"http://localhost:9093/api/v2/alerts\": dial tcp [::1]:9093: connect: connection refusedThese are also a bit strange, IMO,... at least is they'd cause the daemon to exit. I mean it should rather be clear that prometheus may not yet be running.
Anyway, if I start the daemon a bit later, it works just fine.
Cheers, Chris,
Hi!
Alertmanager is crashing because it cannot get the information it needs to initialize the cluster for high availability mode. The error means it cannot find a private IP address for the system, which it advertises to other Alertmanagers in the same cluster.
If you do not need high availability mode, you can disable it with the following argument:
--cluster.listen-address=""
Error sending alert" err="Post "http://localhost:9093/api/v2/alerts": dial tcp [::1]:9093: connect: connection refused
Your Prometheus can't send alerts to Alertmanager because it's crash looping.
But:
- Why does it even try a HA mode, if if haven't configured any other instances? Or are these tried to be auto-detected?
- More important and as I've said before: at the time when
prometheus-alertmanageris loaded during boot, all network interfaces have already been brought up since quite a while, and it does work when I manually start it later (by which time, no further interfaces or addresses have been added).
But:
- Why does it even try a HA mode, if if haven't configured any other instances? Or are these tried to be auto-detected?
That's just the default behavior. I'm not sure it makes sense either, but that's how it has been for as long as I can remember it. Someone else might know if there is a reason for this.
- More important and as I've said before: at the time when
prometheus-alertmanageris loaded during boot, all network interfaces have already been brought up since quite a while, and it does work when I manually start it later (by which time, no further interfaces or addresses have been added).
Does the interface have an IP address at this time? Is it possible it takes a while for an IP address to be assigned via DHCP, meaning the host has an "up" network interface that takes a while to become connected/established? That could explain why it doesn't work immediately on startup but later when you try it manually. Does it also work with systemd if you start it some time later after startup?
遇到了相同的问题,不仅仅是Alertmanager,其他应用也有类似错误,都与Memberlist有关。 如下是LLM给我的建议,经验证确实可行。 我采用的是方案二
maybe,translate by any llms is ok.
🔍 一、错误触发位置分析
在 main.go:278 的代码逻辑中,Alertmanager 初始化 Memberlist(用于集群节点间的 Gossip 协议通信)时失败:
// 源码位置:cmd/alertmanager/main.go
func main() {
// ...
ml, err := createMemberlist(clusterBindAddr, clusterAdvertiseAddr, peerName, cfg.LogLevel == "debug")
if err != nil {
level.Error(logger).Log("msg", "unable to initialize gossip mesh", "err", err) // 行号 278
return
}
}
createMemberlist()函数负责创建 Memberlist 实例,其依赖clusterAdvertiseAddr(集群通告地址)。- 若
clusterAdvertiseAddr为空,Memberlist 会尝试自动检测节点的私有 IP,检测失败则抛出No private IP address found错误。
⚙️ 二、底层 IP 检测机制
Memberlist 的 IP 自动检测逻辑位于其底层库中(非 Alertmanager 代码),核心流程如下:
- 遍历所有网络接口
调用net.Interfaces()获取主机所有网络接口。 - 过滤有效接口
跳过回环接口、未启用接口、无 IPv4 地址的接口。 - 筛选私有 IP(RFC 1918)
保留符合私有 IP 范围的地址(10.0.0.0/8、172.16.0.0/12、192.168.0.0/16)。 - 选择首个有效 IP
若找到多个私有 IP,默认选择第一个。
失败场景:
容器环境中,若节点仅有公有 IP(如云服务器弹性 IP)或自定义网段(如 Docker 的 172.17.0.0/16 被误判为非私有),自动检测会失败。
🐳 三、容器环境特殊问题
在 Kubernetes/Docker 中,以下情况会导致检测失败:
- 多网卡干扰
容器可能绑定多个虚拟网卡(如 CNI 插件创建的veth),部分网卡无有效 IP。 - 非标准私有网段
自定义网络(如172.16.0.0/24)若未包含在 RFC 1918 中,会被忽略。 - IP 未及时分配
容器启动时网络未就绪,IP 尚未注入。
⚡ 四、解决方案
Alertmanager 需通过 --cluster.advertise-address 显式声明集群通信地址。在 Kubernetes 中的实践:
方案一. 通过 Downward API 注入 Pod IP
# Deployment 配置片段
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
args:
- "--cluster.advertise-address=$(POD_IP):9094" # 使用 Pod IP 作为通告地址
- 原理:通过 Kubernetes 提供的
status.podIP动态获取私有 IP,绕过自动检测。
方案二. 修改容器环境的网段
出现异常的时候我的POD 网段: 172.40.0.0/16,后来重建了Kubernetes集群,POD网段使用172.29.0.0/16 就一切正常了
Hi @calestyo ,
I agree with the analysis given by grobinson-grafana and we did not here back since. Therefore, it seems fair to close this.
Kind regards, Solomon Jacobs