mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Ruler: MimirRulerTooManyFailedQueries alert due to user error

Open rekup opened this issue 1 year ago • 6 comments

Describe the bug

We use mimir and the rules from the mimir-mixin. Recently we onboarded a customer who sends kubernetes metrics to our mimir cluster. Due to a configuration error on the customers kubernetes cluster, the kubelet metrics were scraped multiple times (multiple servicemonitors for kubelet). In the kube-prometheus stack there are the following rules:

  spec:
    groups:
    - name: kubelet.rules
      rules:
      - expr: histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.99"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.9, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.9"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.5, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.5"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile

This rules will fail with a many-to-many matching not allowed error if the kubelet is scraped by multiple jobs. This is obiously a user error and in the mimir logs we can observe the corresponding error messages:

ts=2024-03-20T06:27:56.041993291Z caller=group.go:480 level=warn name=KubeletPodStartUpLatencyHigh index=5 component=ruler insight=true user=tenant1 file=/var/lib/mimir/ruler/tenant1/agent%2Fmonitoring%2Fkube-prometheus-stack-kubernetes-system-kubelet%2Ff4851e1e-c337-4c08-8dc2-4c47642212f9 group=kubernetes-system-kubelet msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n    on node {{ $labels.node }}.\n  runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n  summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"192.168.16.181:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-stack-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"

As soon as Mimir evaluates these rules, the MimirRulerTooManyFailedQueries alert is triggered. However, according to the runbook of this alert, these user errors should not trigger this alert:

Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are “user errors”, and this alert ignores them.

(https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#mimirrulertoomanyfailedqueries)

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir (Mimir, version 2.11.0 (branch: release-2.11, revision: c8939ea55))
  2. Create multiple scrape jobs for the kubelet
  3. Create the recording rules specified above
  4. Check the result of the following rule (as per the mimir mixin)
100 * (sum by (cluster, team, instance) (rate(cortex_ruler_queries_failed_total{job="mimir"}[5m])) / sum by (cluster, team, instance) (rate(cortex_ruler_queries_total{job="mimir"}[5m]))) > 1

Expected behavior

I would expect that user errors (such as rule with many-to-many matching will not increase the cortex_ruler_queries_failed_total counter.

Environment

  • Infrastructure: bare-metal
  • Deployment tool: ansible
  • mimir mixin (main, https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L434)

Additional Context

I saw this cortex issue which might be of relevance

rekup avatar Mar 20 '24 06:03 rekup

cc: @krajorama

We also got MimirRulerTooManyFailedQueries due to a bad rule uploaded by user.

rishabhkumar92 avatar Mar 25 '24 05:03 rishabhkumar92

Reproduced with mimir-distributed 5.2.2 (Mimir 2.11).

Update: this repro uses the built in querier in the ruler, not the remote ruler-querier functionality!

I've started the chart with metamonitor enabled to get some metrics and created a recording rule for cortex_ingester_active_series{} * on (container) cortex_build_info which results in error:

execution: found duplicate series for the match group {container="ingester"} on the right hand-side of the operation:
 [{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.175:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-a-0", revision="c8939ea", service="krajo-mimir-ingester-zone-a", version="2.11.0"}, 
{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.145:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-b-0", revision="c8939ea", service="krajo-mimir-ingester-zone-b", version="2.11.0"}];
many-to-many matching not allowed: matching labels must be unique on one side

I see cortex_ruler_queries_failed_total increase.

krajorama avatar Apr 02 '24 07:04 krajorama

I've upgraded to 5.3.0-weekly.283 which has build from 26th March (https://github.com/grafana/mimir/tree/r283 , https://github.com/grafana/mimir/commit/7728f420184b09770910248966db8801a1c2cabc ). It doesn't have the issue , cortex_ruler_queries_failed_total dropped to 0, the logs still contain the error message.

At the same time I see cortex_prometheus_rule_evaluation_failures_total show the errors, with these labels:

__replica__="replica-0",
cluster="krajo",
container="ruler",
endpoint="http-metrics",
instance="10.1.23.183:8080",
job="dev/ruler",
namespace="dev",
pod="krajo-mimir-ruler-85dc995ff6-9h5jj",
rule_group="/data/metamonitoring/krajons;krajogroup",
service="krajo-mimir-ruler",
user="metamonitoring"

I'm pretty sure this was actually fixed by me in https://github.com/grafana/mimir/pull/7567 . However this PR just missed the cut off for 2.12 release by a couple of days.

krajorama avatar Apr 02 '24 07:04 krajorama

Could not reproduced with remote ruler on latest weekly (r284-6db12671).

At first I thought I did, but the ruler dashboard actually uses cortex_prometheus_rule_evaluation_failures_total that started increasing and not cortex_ruler_queries_failed_total, which remained at 0.

krajorama avatar Apr 03 '24 06:04 krajorama

Tested in v2.12.0-rc.4. Could not reproduce, so I think the remote ruler version is fixed in 2.12 most likely by #7472 .

Summary: should be fixed in remote ruler case in 2.12. And will be fixed for normal ruler case in 2.13.

krajorama avatar Apr 03 '24 07:04 krajorama

2.12 has been released. We should be good to close this, right?

56quarters avatar May 14 '24 15:05 56quarters