awesome-prometheus-alerts icon indicating copy to clipboard operation
awesome-prometheus-alerts copied to clipboard

changed Kernel info breaks querie(s)

Open roock opened this issue 1 year ago • 1 comments

This isse can be observed with at least the following alerts, but it might also affect other alerts using node_uname_info:

  • HostOutOfMemory
  • HostUnusualDiskReadRate
  • HostUnusualDiskWriteRate
  • HostHighCpuLoad
  • HostCpuHighIowait
  • HostPhysicalComponentTooHot

This is happening in the following conditions:

  • the servers is rebooted with a new(er) kernel and the version/release information changes
  • the servers has multiple monitored partitions e.g. rootfs and /srv for HostUnusualDiskReadRate and HostUnusualDiskWriteRate
execution: found duplicate series for the match group {instance="monitor.localdomain:9100"} on the right hand-side of the
operation:
[{__name__="node_uname_info", domainname="(none)", group="infra", instance="monitor.localdomain:9100",
job="node", machine="x86_64", nodename="monitor", release="5.10.0-27-amd64", sysname="Linux",
version="#1 SMP Debian 5.10.205-2 (2023-12-31)"},
{__name__="node_uname_info", domainname="(none)", group="infra", instance="monitor.localdomain:9100",
job="node", machine="x86_64", nodename="monitor", release="5.10.0-26-amd64", sysname="Linux",
version="#1 SMP Debian 5.10.197-1 (2023-09-29)"}];
many-to-many matching not allowed: matching labels must be unique on one side

roock avatar Jan 09 '24 11:01 roock

I believe the reason is this part of the query: on(instance) group_left (nodename) node_uname_info{nodename=~".+"}

It is probably unnecessary and should be handled by relabeling in Prometheus or using a regexp.

guruevi avatar Feb 25 '24 01:02 guruevi