helm-charts
helm-charts copied to clipboard
[kube-prometheus-stack] [48.2.3] many-to-many matching not allowed
Describe the bug a clear and concise description of what the bug is.
Hi, after the default install i see this messages in prometheus log:
ts=2023-08-04T08:39:21.061Z caller=manager.go:663 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0/monitoring-prometheus-kubelet.rules-882ac82c-59d3-443c-8744-33bf4f2a9757.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=2 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))\n * on (cluster, instance) group_left (node) kubelet_node_name{job="kubelet",metrics_path="/metrics"})\nlabels:\n quantile: "0.5"\n" err="found duplicate series for the match group {instance="10.124.10.2:10250"} on the right hand-side of the operation: [{name="kubelet_node_name", instance="10.124.10.2:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="cp2.example.com", service="prometheus-kubelet"}, {name="kubelet_node_name", instance="10.124.10.2:10250", job="kubelet", metrics_path="/metrics", namespace="kube-system", node="cp1.example.com", service="bdoc-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side"
What's your helm version?
v3.12.0
What's your kubectl version?
1.26.4
Which chart?
prometheus-kube-stack
What's the chart version?
48.2.3
What happened?
Installed the chart with default values
What you expected to happen?
no errors in prometheus log
How to reproduce it?
installing the chart with default values
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
installed with ansible helm module
Anything else we need to know?
No response
This might help:
I was still experiencing issues with version: 48.3.1
When I run: kubectl get svc -n kube-system -l k8s-app=kubelet
It was listing 3 services with the same label, that I think its the issue:
kube-prometheus-stack-kubelet prom-kube-prometheus-stack-kubelet prometheus-kube-prometheus-kubelet
With this of course I experienced those issues all the time.
But, I deleted everything with helm uninstall kube-prometheus-stack -n monitoring
and then mannualy deleted the services in 'kube-system'
Now I only have one service in kube-system: kube-prometheus-stack-kubelet and I dont have log errors in the prometheus pod.
I guess this was happening due to previous implementations and some resources were not deleted correctly
We managed to solve this by deleting one of the Service
objects that was left after a uninstall/install of kube-prometheus-stack
. As seen of the error message above there's two sets of labels mentioned, using the group_left
/group_right
functions in PromQL mandates that the labels shall be equal. The Service
object label is not same here, hence the error message!
{
name="kubelet_node_name",
instance="10.124.10.2:10250",
job="kubelet",
metrics_path="/metrics",
namespace="kube-system",
node="cp2.example.com",
service="prometheus-kubelet"
},
{
name="kubelet_node_name",
instance="10.124.10.2:10250",
job="kubelet",
metrics_path="/metrics",
namespace="kube-system",
node="cp1.example.com",
service="bdoc-kubelet"
}
Try to remove one of them that isn't in use and see if the problem get resolved!
I also faced this issue in 48.3.1, I solved this by configuring Prometheus Operator not to create kubelet service into "kube-system" namespace. On my GKE cluster there is an exising "kubelet" service in the "kube-system".
kube-prometheus-stack:
enabled: true
fullnameOverride: prometheus
prometheusOperator:
kubeletService:
enabled: false
Below are the services in my "kube-system" namespace (on GKE cluster).
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
calico-typha ClusterIP 172.16.0.246 <none> 5473/TCP 412d
default-http-backend NodePort 172.16.0.86 <none> 80:30648/TCP 412d
kube-dns ClusterIP 172.16.0.10 <none> 53/UDP,53/TCP 412d
kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 412d
metrics-server ClusterIP 172.16.0.121 <none> 443/TCP 412d
prometheus-coredns ClusterIP None <none> 9153/TCP 119m
prometheus-kube-controller-manager ClusterIP None <none> 10257/TCP 119m
prometheus-kube-etcd ClusterIP None <none> 2381/TCP 119m
prometheus-kube-proxy ClusterIP None <none> 10249/TCP 119m
prometheus-kube-scheduler ClusterIP None <none> 10259/TCP 119m
prometheus-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 6m51s
The service "kubelet" one which I believe it is created by GKE and the 2nd one is "prometheus-kubelet" which created by Prometheus Operator. Just turn it off by YAML snippet above.
I hope this will help.
Once you re-configured the Prometheus Operator you may need to manually delete this service "prometheus-kubelet" yourself, it won't be re-created.
This might help:
I was still experiencing issues with version: 48.3.1
When I run:
kubectl get svc -n kube-system -l k8s-app=kubelet
It was listing 3 services with the same label, that I think its the issue:
kube-prometheus-stack-kubelet prom-kube-prometheus-stack-kubelet prometheus-kube-prometheus-kubelet
With this of course I experienced those issues all the time. But, I deleted everything with
helm uninstall kube-prometheus-stack -n monitoring
and then mannualy deleted the services in 'kube-system'Now I only have one service in kube-system: kube-prometheus-stack-kubelet and I dont have log errors in the prometheus pod.
I guess this was happening due to previous implementations and some resources were not deleted correctly
Thanks for this. Its work for me.
I had installed three versions of helm (kube-prometheus-stack) and this services stayed in my cluster.
I also faced this issue in 48.3.1, I solved this by configuring Prometheus Operator not to create kubelet service into "kube-system" namespace. On my GKE cluster there is an exising "kubelet" service in the "kube-system".
kube-prometheus-stack: enabled: true fullnameOverride: prometheus prometheusOperator: kubeletService: enabled: false
Below are the services in my "kube-system" namespace (on GKE cluster).
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE calico-typha ClusterIP 172.16.0.246 <none> 5473/TCP 412d default-http-backend NodePort 172.16.0.86 <none> 80:30648/TCP 412d kube-dns ClusterIP 172.16.0.10 <none> 53/UDP,53/TCP 412d kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 412d metrics-server ClusterIP 172.16.0.121 <none> 443/TCP 412d prometheus-coredns ClusterIP None <none> 9153/TCP 119m prometheus-kube-controller-manager ClusterIP None <none> 10257/TCP 119m prometheus-kube-etcd ClusterIP None <none> 2381/TCP 119m prometheus-kube-proxy ClusterIP None <none> 10249/TCP 119m prometheus-kube-scheduler ClusterIP None <none> 10259/TCP 119m prometheus-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 6m51s
The service "kubelet" one which I believe it is created by GKE and the 2nd one is "prometheus-kubelet" which created by Prometheus Operator. Just turn it off by YAML snippet above.
I hope this will help.
Once you re-configured the Prometheus Operator you may need to manually delete this service "prometheus-kubelet" yourself, it won't be re-created.
by the way, is there any effect if deleted service kubelet
o.O (I have deleted it already)
Our K8S cluster is installed with Talos Linux, here is some information about the cluster:
- Total ndoes: 6
- Controlplane: 3
- Workers: 3
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
example-core-controlplane-111 Ready control-plane 3d14h v1.30.1 10.10.30.111 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
example-core-controlplane-112 Ready control-plane 3d14h v1.30.1 10.10.30.112 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
example-core-controlplane-113 Ready control-plane 3d14h v1.30.1 10.10.30.113 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
example-core-worker-121 Ready <none> 3d14h v1.30.1 10.10.30.121 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
example-core-worker-122 Ready <none> 3d14h v1.30.1 10.10.30.122 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
example-core-worker-123 Ready <none> 3d14h v1.30.1 10.10.30.123 <none> Talos (v1.7.4) 6.6.32-talos containerd://1.7.16
kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
cilium-49jps 1/1 Running 0 2d10h
cilium-4kgsn 1/1 Running 2 (2d10h ago) 2d12h
cilium-5z9kn 1/1 Running 2 (2d10h ago) 3d14h
cilium-9ptxb 1/1 Running 1 (2d10h ago) 2d11h
cilium-h6zp6 1/1 Running 0 2d10h
cilium-ksnhp 1/1 Running 1 (2d10h ago) 2d11h
cilium-operator-d759676d-5mlnv 1/1 Running 1 (129m ago) 2d10h
cilium-operator-d759676d-j9bws 1/1 Running 4 (130m ago) 2d12h
cilium-operator-d759676d-ks9qj 0/1 Completed 0 3d14h
coredns-64b67fc8fd-kwmv6 1/1 Running 0 2d10h
coredns-64b67fc8fd-x9f6t 1/1 Running 1 (2d10h ago) 2d11h
hubble-relay-7b8fb45847-qr5hj 1/1 Running 0 2d10h
hubble-ui-69f99566c5-cfrjh 2/2 Running 0 2d10h
kube-apiserver-example-core-controlplane-111 1/1 Running 0 128m
kube-apiserver-example-core-controlplane-112 1/1 Running 0 128m
kube-apiserver-example-core-controlplane-113 1/1 Running 0 128m
kube-controller-manager-example-core-controlplane-111 1/1 Running 0 9h
kube-controller-manager-example-core-controlplane-112 1/1 Running 3 (129m ago) 129m
kube-controller-manager-example-core-controlplane-113 1/1 Running 3 (129m ago) 129m
kube-scheduler-example-core-controlplane-111 1/1 Running 3 (129m ago) 129m
kube-scheduler-example-core-controlplane-112 1/1 Running 3 (129m ago) 129m
kube-scheduler-example-core-controlplane-113 1/1 Running 3 (129m ago) 129m
kubectl -n kube-system get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
core-kube-prometheus-stack-coredns ClusterIP None <none> 9153/TCP 58m k8s-app=kube-dns
core-kube-prometheus-stack-kube-controller-manager ClusterIP None <none> 10257/TCP 58m k8s-app=kube-controller-manager
core-kube-prometheus-stack-kube-scheduler ClusterIP None <none> 10259/TCP 58m k8s-app=kube-scheduler
core-kube-prometheus-stack-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 58m <none>
hubble-peer ClusterIP 10.100.114.25 <none> 443/TCP 3d14h k8s-app=cilium
hubble-relay ClusterIP 10.98.187.101 <none> 80/TCP 3d14h k8s-app=hubble-relay
hubble-ui ClusterIP 10.110.242.16 <none> 80/TCP 3d14h k8s-app=hubble-ui
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 3d14h k8s-app=kube-dns
kubectl -n kube-system get ep -o wide
NAME ENDPOINTS AGE
core-kube-prometheus-stack-coredns 10.244.2.145:9153,10.244.3.113:9153 59m
core-kube-prometheus-stack-kube-controller-manager 10.10.30.111:10257,10.10.30.112:10257,10.10.30.113:10257 59m
core-kube-prometheus-stack-kube-scheduler 10.10.30.111:10259,10.10.30.112:10259,10.10.30.113:10259 59m
core-kube-prometheus-stack-kubelet 10.10.30.111:10250,10.10.30.112:10250,10.10.30.113:10250 + 15 more... 59m
hubble-peer 10.10.30.111:4244,10.10.30.112:4244,10.10.30.113:4244 + 3 more... 3d14h
hubble-relay 10.244.3.119:4245 3d14h
hubble-ui 10.244.5.129:8081 3d14h
kube-dns 10.244.2.145:53,10.244.3.113:53,10.244.2.145:53 + 3 more... 3d14h
Preparing values.yaml
for installing prometheus:
kubeControllerManager:
service:
selector:
k8s-app: kube-controller-manager
kubeScheduler:
service:
selector:
k8s-app: kube-scheduler
kubeEtcd:
# In a Talos Linux setup, the etcd component is typically managed as a static pod directly
# on the control plane nodes, not as a regular pod within the Kubernetes cluster.
enabled: false
kubeProxy:
# KubeProxy disabled in Talos Linux Setup, we are using KubePrism (KubePrism feature solves this problem by enabling in-cluster highly-available controlplane endpoint on every node in the cluster.)
enabled: false
kube-prometheus-stack installed as follows:
helm upgrade --install core prometheus-community/kube-prometheus-stack \
--namespace monitoring-core \
--version 61.1.0 \
--values values.yaml
As you can see, there are 6 Kubelet instances:
There are some errors like execution: multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)
in Networking environments dashboards, e. g.:
Is this the right way?
Our questions:
- Is our setup wrong?
- Is there a bug that should be fixed in the prometheus-community charts?
- Thanks for any advice!