ha_cluster_exporter
ha_cluster_exporter copied to clipboard
Corosyc/Pacemaker metrics issue on one node in cluster
Hi, we have several three-node clusters with the Corosyc/Pacemaker setup. There is a ha_cluster_exporter set on all of them, and it works just fine. Only on one node in one cluster, we get an error like this when I open the metrics URL:
An error has occurred while serving metrics:
collected metric "ha_cluster_corosync_member_votes" { label:<name:"local" value:"false" > label:<name:"node" value:"NR" > label:<name:"node_id" value:"32566" > gauge:<value:3 > } was collected before with the same name and label values
I checked all nodes in the cluster, and all of them have different IDs:
- node1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="2"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32636"} 3
ha_cluster_corosync_member_votes{local="true",node="node1.infra.env",node_id="1"} 1
- node2
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="1"} 1
ha_cluster_corosync_member_votes{local="false",node="xxxx",node_id="3"} 1
ha_cluster_corosync_member_votes{local="false",node="NR",node_id="32652"} 2
Logs are very similar on all of the nodes:
level=info msg="Starting ha_cluster_exporter (version=1.3.0+git.1653405719.2a65dfc, branch=HEAD, revision=2a65dfc015e614e53f34effbd0847cc20317b952)"
level=info msg="Build context (go=go1.16.15, user=runner@fv-az341-182, date=20220524-15:44:13)"
level=warn msg="Reading config file failed" err="Config File \"ha_cluster_exporter\" Not Found in \"[/ /root/.config /etc /usr/etc]\""
level=info msg="Default config values will be used"
level=warn msg="Registration failure" err="could not initialize 'sbd' collector: '/usr/sbin/sbd' does not exist"
level=warn msg="Registration failure" err="could not initialize 'drbd' collector: '/sbin/drbdsetup' does not exist"
level=info msg="pacemaker collector registered."
level=info msg="corosync collector registered."
level=info msg="Serving metrics on :9664/metrics"
level=warn msg="Reading web config file failed" err="stat /etc/ha_cluster_exporter.web.yaml: no such file or directory"
level=info msg="Default web config or commandline values will be used"
level=info msg="TLS is disabled." http2=false
All nodes have the same configuration (OS, HDD, RAM, CPU) and are built and provisioned using Puppet configuration management.
Service file is very simple:
[Unit]
Description=Prometheus ha_cluster_exporter
Wants=network-online.target
After=network-online.target
[Service]
User=root
Group=root
ExecStart=/usr/local/bin/ha_cluster_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=always
[Install]
WantedBy=multi-user.target
For the love of God, I cannot find what could be an issue here. Did we make some misconfiguration, or did we miss some of that? There is nothing special set; we install the exporter and run it.
OS is Debian 11, version of exporter is 1.3.3 (but same issue with older versions too).
this is bug
Thanks for your bug report. This is definitely not supposed to happen.
Could you please report the output of corosync-quorumtool -p on both nodes?
Yes, here is the output from all three nodes:
root@node1:~# corosync-quorumtool -p
Quorum information
------------------
Date: Tue Mar 5 14:11:15 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 1
Ring ID: 1.149
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Qdevice Name
1 1 NR node1.infra.env (local)
2 1 NR XXXX:YYYY:ZZZZ:QQQQ::62%32695
3 1 NR XXXX:YYYY:ZZZZ:QQQQ::63%32695
root@node2:~# corosync-quorumtool -p
Quorum information
------------------
Date: Tue Mar 5 14:11:56 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 2
Ring ID: 1.149
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Qdevice Name
1 1 NR XXXX:YYYY:ZZZZ:QQQQ::61%32620
2 1 NR node2.infra.env (local)
3 1 NR XXXX:YYYY:ZZZZ:QQQQ::63%32620
root@node3:~# corosync-quorumtool -p
Quorum information
------------------
Date: Tue Mar 5 14:12:31 2024
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 3
Ring ID: 1.149
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Qdevice Name
1 1 NR XXXX:YYYY:ZZZZ:QQQQ::61%32728
2 1 NR XXXX:YYYY:ZZZZ:QQQQ::62%32728
3 1 NR node3.infra.env (local)
Thanks, I will look into it sometime over the coming weeks and let you know.
Thanks a lot for the effort.