awesome-prometheus-alerts
awesome-prometheus-alerts copied to clipboard
changed Kernel info breaks querie(s)
This isse can be observed with at least the following alerts, but it might also affect other alerts using node_uname_info:
- HostOutOfMemory
- HostUnusualDiskReadRate
- HostUnusualDiskWriteRate
- HostHighCpuLoad
- HostCpuHighIowait
- HostPhysicalComponentTooHot
This is happening in the following conditions:
- the servers is rebooted with a new(er) kernel and the version/release information changes
- the servers has multiple monitored partitions e.g. rootfs and /srv for HostUnusualDiskReadRate and HostUnusualDiskWriteRate
execution: found duplicate series for the match group {instance="monitor.localdomain:9100"} on the right hand-side of the
operation:
[{__name__="node_uname_info", domainname="(none)", group="infra", instance="monitor.localdomain:9100",
job="node", machine="x86_64", nodename="monitor", release="5.10.0-27-amd64", sysname="Linux",
version="#1 SMP Debian 5.10.205-2 (2023-12-31)"},
{__name__="node_uname_info", domainname="(none)", group="infra", instance="monitor.localdomain:9100",
job="node", machine="x86_64", nodename="monitor", release="5.10.0-26-amd64", sysname="Linux",
version="#1 SMP Debian 5.10.197-1 (2023-09-29)"}];
many-to-many matching not allowed: matching labels must be unique on one side
I believe the reason is this part of the query: on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
It is probably unnecessary and should be handled by relabeling in Prometheus or using a regexp.