helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] Evaluating rule failed

Open usternik opened this issue 2 years ago • 4 comments

Describe the bug a clear and concise description of what the bug is.

node_namespace_pod_container:container_memory_rss is fails evaluating every hour or so because of multiple matches for labels: grouping labels must ensure unique matches

I saw this issue https://github.com/prometheus-operator/prometheus-operator/issues/1319. But the honorlabel: true option was added to the templates.

What's your helm version?

3.6.3

What's your kubectl version?

1.22.0

Which chart?

kube-prometheus-stack

What's the chart version?

32.2.1

What happened?

We are seeing the following errors from Alertmanager. Prometheus monitoring/prometheus-prometheus-operator-kube-p-prometheus-1 has failed to evaluate 4 rules in the last 5m.

It appears to be happening every few hours, for a few minutes. Here is an image for reference: image

Looking at prometheus logs we see: ts=2022-06-26T13:48:58.529Z caller=manager.go:609 level=warn component="rule manager" group=k8s.rules msg="Evaluating rule failed" rule="record: node_namespace_pod_container:container_memory_rss\nexpr: container_memory_rss{image!=\"\",job=\"kubelet\",metrics_path=\"/metrics/cadvisor\"}\n * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace,\n pod, node) (kube_pod_info{node!=\"\"}))\n" err="multiple matches for labels: grouping labels must ensure unique matches"

I am not sure if it's a specific thing in our cluster that triggers those errors but it looks like node_namespace_pod_container:container_memory_rss causes these errors. This is the query: container_memory_rss{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}))

I would have suggested a fix, but I don't understand Promql well enough to figure out what causes this error.

What you expected to happen?

No errors

How to reproduce it?

Not sure

Enter the changed values of values.yaml?

prometheusOperator: kubeletService: namespace: monitoring

kubelet: namespace: monitoring

nodeExporter: enabled: true

kubeProxy: enabled: false

kubeApiServer: enabled: true

kubeStateMetrics: enabled: true

alertmanager: alertmanagerSpec: externalUrl: <URL> storage: volumeClaimTemplate: spec: storageClassName: default accessModes: ["ReadWriteOnce"] resources: requests: storage: 2Gi

prometheus: prometheusSpec: image: tag: v2.33.3 logLevel: debug

Enter the command that you execute and failing/misfunctioning.

Deployed via terraform

Anything else we need to know?

No response

usternik avatar Jun 26 '22 15:06 usternik

Opened also issue in kubernetes-mixin.

usternik avatar Jun 28 '22 08:06 usternik

We are running on AKS with a substantial amount of spot machines. I suspect it originates in unready Nodes but not sure

usternik avatar Jun 29 '22 16:06 usternik

Remove the prometheus_replica label injection as metric label. When trying to group two metrics by its labels its finding more than 1 matches because of that

druanoor avatar Jun 30 '22 17:06 druanoor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 08:07 stale[bot]

This issue is being automatically closed due to inactivity.

stale[bot] avatar Sep 20 '22 18:09 stale[bot]