[kube-prometheus-stack] Crash loop back on startup (matching labels must be unique on one side)
Describe the bug a clear and concise description of what the bug is.
Crash loopback on startup.
matching labels must be unique on one side
What's your helm version?
version.BuildInfo{Version:"v3.9.3", GitCommit:"414ff28d4029ae8c8b05d62aa06c7fe3dee2bc58", GitTreeState:"clean", GoVersion:"go1.17.13"}
What's your kubectl version?
Client Version: v1.24.3 Kustomize Version: v4.5.4 Server Version: v1.21.5
Which chart?
kube-prometheus-stack
What's the chart version?
42.3.0
What happened?
ts=2023-04-24T12:26:22.131Z caller=main.go:1221 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=445.781959ms db_storage=26.996µs remote_storage=6.487µs web_handler=1.266µs query_engine=2.241µs scrape=22.247212ms scrape_sd=82.817005ms notify=1.645616ms notify_sd=3.038914ms rules=277.574397ms tracing=1.287696ms
ts=2023-04-24T12:26:22.131Z caller=main.go:965 level=info msg="Server is ready to receive web requests."
ts=2023-04-24T12:26:22.131Z caller=manager.go:943 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-04-24T12:26:44.111Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubernetes-system-kubelet-0b034194-7f11-4bac-af6e-474aa7b075c2.yaml group=kubernetes-system-kubelet name=KubeletPodStartUpLatencyHigh index=5 msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n > 60\nfor: 15m\nlabels:\n severity: warning\nannotations:\n description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n on node {{ $labels.node }}.\n runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"10.15.26.12:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.12:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplaner-1\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.12:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplaner-1\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:48.738Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=0 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n quantile: \"0.99\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:49.175Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=1 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.9, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n quantile: \"0.9\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:49.537Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=2 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n quantile: \"0.5\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
What you expected to happen?
No response
How to reproduce it?
No response
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
helm install kube-prometheus --namespace monitoring kube-prometheus-stack -f values.yaml
alertmanager:
enabled: true
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: rook-ceph-block
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
enabled: true
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
hosts:
- grafana.mycluster.mycompany.local
persistence:
type: pvc
enabled: true
storageClassName: rook-ceph-block
accessModes:
- ReadWriteOnce
size: 10Gi
prometheus:
enabled: true
thanosService:
enabled: true
thanosServiceMonitor:
enabled: false
extraSecret:
name: thanos-objstore-config
data:
thanos-storage-config.yaml: |-
type: S3
config:
bucket: thanos-data
endpoint: minio.minio.svc.cluster.local:9000
access_key: access_key
secret_key: access_pass
insecure: true
prometheusSpec:
disableCompaction: true
#retention: 2h
retention: 20d
replicas: 2
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: rook-ceph-block
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 500Gi
thanos:
objectStorageConfig:
key: "thanos-storage-config.yaml"
name: "thanos-objstore-config"
Anything else we need to know?
I checked this issue and tried to delete the svc in the kube-system namespace. But it's not working, and after deleting the svc, it was created again. https://github.com/prometheus-community/helm-charts/issues/635#issuecomment-774771566
I also tried 45.6.0 , and got the same error.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
I am seeing the same thing. Please do not close this automatically.
We had the same issue. In our case the reason was that the Helm chart was first deployed with another release name, then the chart was uninstalled but it left some resources in the cluster and on deploying it with the changed release name, we saw the error. Removing the old resources (IIRC it was some services) fixed the problem.
happens to me also with 58.5.3
This solved my issues:
kubectl -n kube-system delete svc prometheus-kube-prometheus-kubelet
In our case the reason was that the Helm chart was first deployed with another release name, then the chart was uninstalled but it left some resources in the cluster
This was my case as well - some Service-s in kube-system were no cleaned up when uninstalling a release, and were affecting a new release with a different name. Manually deleting the two services with old release name from kube-system solved it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.