helm-charts
helm-charts copied to clipboard
[kube-prometheus-stack] Evaluating rule failed
Describe the bug a clear and concise description of what the bug is.
node_namespace_pod_container:container_memory_rss
is fails evaluating every hour or so because of multiple matches for labels: grouping labels must ensure unique matches
I saw this issue https://github.com/prometheus-operator/prometheus-operator/issues/1319. But the honorlabel: true option was added to the templates.
What's your helm version?
3.6.3
What's your kubectl version?
1.22.0
Which chart?
kube-prometheus-stack
What's the chart version?
32.2.1
What happened?
We are seeing the following errors from Alertmanager.
Prometheus monitoring/prometheus-prometheus-operator-kube-p-prometheus-1 has failed to evaluate 4 rules in the last 5m.
It appears to be happening every few hours, for a few minutes. Here is an image for reference:
Looking at prometheus logs we see:
ts=2022-06-26T13:48:58.529Z caller=manager.go:609 level=warn component="rule manager" group=k8s.rules msg="Evaluating rule failed" rule="record: node_namespace_pod_container:container_memory_rss\nexpr: container_memory_rss{image!=\"\",job=\"kubelet\",metrics_path=\"/metrics/cadvisor\"}\n * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace,\n pod, node) (kube_pod_info{node!=\"\"}))\n" err="multiple matches for labels: grouping labels must ensure unique matches"
I am not sure if it's a specific thing in our cluster that triggers those errors but it looks like node_namespace_pod_container:container_memory_rss
causes these errors. This is the query: container_memory_rss{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}))
I would have suggested a fix, but I don't understand Promql well enough to figure out what causes this error.
What you expected to happen?
No errors
How to reproduce it?
Not sure
Enter the changed values of values.yaml?
prometheusOperator: kubeletService: namespace: monitoring
kubelet: namespace: monitoring
nodeExporter: enabled: true
kubeProxy: enabled: false
kubeApiServer: enabled: true
kubeStateMetrics: enabled: true
alertmanager: alertmanagerSpec: externalUrl: <URL> storage: volumeClaimTemplate: spec: storageClassName: default accessModes: ["ReadWriteOnce"] resources: requests: storage: 2Gi
prometheus: prometheusSpec: image: tag: v2.33.3 logLevel: debug
Enter the command that you execute and failing/misfunctioning.
Deployed via terraform
Anything else we need to know?
No response
Opened also issue in kubernetes-mixin.
We are running on AKS with a substantial amount of spot machines. I suspect it originates in unready Nodes but not sure
Remove the prometheus_replica label injection as metric label. When trying to group two metrics by its labels its finding more than 1 matches because of that
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.