helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] Kube prometheus default alerts issue

Open vijaymailb opened this issue 1 year ago • 0 comments

Describe the bug a clear and concise description of what the bug is.

We are using kube prometheus stack version 61.3.2 with default prometheus rules for all its components. As we intend to use https://github.com/cloudflare/pint for linting and identify missing metrics, we found many default prometheus rule with linting issue.

What's your helm version?

61.3.2

What's your kubectl version?

1.28.11

Which chart?

https://github.com/prometheus-community/helm-charts/edit/kube-prometheus-stack-61.3.2/

What's the chart version?

61.3.2

What happened?

Following default prometheus rules have issue.

for example: pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterCrashlooping",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
expression:
  - alert: AlertmanagerClusterCrashlooping
    annotations:
      description: '{{ $value | humanizePercentage }} of Alertmanager instances within
        the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'
      runbook_url: https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclustercrashlooping
      summary: Half or more of the Alertmanager instances within the same cluster
        are crashlooping.
    expr: |-
      (
        count by (namespace,service,cluster) (
          changes(process_start_time_seconds{job="prometheus-stack-alertmanager",namespace="namespace1"}[10m]) > 4
        )
      /
        count by (namespace,service,cluster) (
          up{job="prometheus-stack-alertmanager",namespace="namespace1"}
        )
      )
      >= 0.5
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-etcd-test-f4c8-4208-a6b8-57da78332911.yaml",kind="alerting",name="etcdHighNumberOfLeaderChanges",owner="",problem="Template is using `job` label but `absent()` is not passing it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterCrashlooping",owner="",problem="`prom` Prometheus server at http://localhost:9090 has `process_start_time_seconds` metric with `job` label but there are no series matching `{job=\"prometheus-stack-alertmanager\"}` in the last 1w.",reporter="promql/series",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterDown",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_failed_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\", integration=\"\"}` if you want to match on all time series for `alertmanager_notifications_failed_total` without the `integration` label.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_failed_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\"}` if you want to match on all `integration` values.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\", integration=\"\"}` if you want to match on all time series for `alertmanager_notifications_total` without the `integration` label.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerClusterFailedToSendAlerts",owner="",problem="Unnecessary wildcard regexp, simply use `alertmanager_notifications_total{job=\"prometheus-stack-alertmanager\", namespace=\"core-stack\"}` if you want to match on all `integration` values.",reporter="promql/regexp",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-alertmanager.rules-8989ea89-f65c-452c-9fd8-f8269af0e2f7.yaml",kind="alerting",name="AlertmanagerConfigInconsistent",owner="",problem="Template is using `job` label but the query removes it.",reporter="alerts/template",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-node-exporter.rules-4a1ae679-a8b5-4bbf-9bfd-ed4f42728e9f.yaml",kind="recording",name="instance:node_load1_per_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[container, endpoint, instance, job, namespace, node, pod, service]` != `[container, endpoint, instance, job, namespace, node, pod, receiver_opsgenie_admins, receiver_slack_cluster, service]`. Failing query: `node_load1{job=\"node-exporter\"} / instance:node_num_cpu:sum{job=\"node-exporter\"}`.",reporter="promql/vector_matching",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-kube-prometheus-node-recording.rules-7235730b-029a-4598-9d86-9729c424a8e2.yaml",kind="recording",name="cluster:node_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[receiver_opsgenie_admins, receiver_slack_cluster]` != `[]`. Failing query: `cluster:node_cpu:sum_rate5m / count(sum by (instance, cpu) (node_cpu_seconds_total))`.",reporter="promql/vector_matching",severity="bug"} 1
pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-node-exporter.rules-4a1ae679-a8b5-4bbf-9bfd-ed4f42728e9f.yaml",kind="recording",name="instance:node_load1_per_cpu:ratio",owner="",problem="This query will never return anything on `prom` Prometheus server at http://localhost:9090 because results from the right and the left hand side have different labels: `[container, endpoint, instance, job, namespace, node, pod, service]` != `[container, endpoint, instance, job, namespace, node, pod, receiver_opsgenie_admins, receiver_slack_cluster, service]`. Failing query: `node_load1{job=\"node-exporter\"} / instance:node_num_cpu:sum{job=\"node-exporter\"}`.",reporter="promql/vector_matching",severity="bug"} 1```

and almost all alerts under `https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kubernetes-apps.yaml`
```pint_problem{filename="/etc/prometheus/rules/prometheus-prometheus-stack-prometheus-rulefiles-0/core-stack-prometheus-stack-kubernetes-apps-e679ee64-11ae-433d-820f-b5221857004e.yaml",kind="alerting",name="KubeStatefulSetUpdateNotRolledOut",owner="",problem="Unnecessary wildcard regexp, simply use `kube_statefulset_replicas{job=\"kube-state-metrics\"}` if you want to match on all `namespace` values.",reporter="promql/regexp",severity="bug"} 1```

All the above alerts having lint issue.

### What you expected to happen?

Prometheus default rules needs to be adjusted in order to get rid of linting errors.

### How to reproduce it?

Run pint as sidecar to the prometheus to get the linting alerts.

### Enter the changed values of values.yaml?

_No response_

### Enter the command that you execute and failing/misfunctioning.

It has nothing to do with helm chart as we need to adjust default prometheus rules.

### Anything else we need to know?

_No response_

vijaymailb avatar Sep 23 '24 11:09 vijaymailb