kube-state-metrics kube_pod_annotations reports incorrect nodes (all metrics for the same node)

What happened: I have 50+ K8S nodes and they run similar app pods with annotations. All kube_pod_annotations have the same node, example:

count(kube_pod_annotations{annotation_jenkins_template!="",node=~"k8s-node55.*"}) = 329 (so 329 pods assigned to k8s-node55, none to others)
count(kube_pod_annotations{annotation_jenkins_template!="",node=~"k8s-node55.*"} and on (pod) kube_pod_info{node=~"k8s-node55.*"}) = 4 (so there are 4 running pods on the node with annotations and this exactly matches kubectl get pods command)

What you expected to happen:

I expect kube_pod_annotations would have correct nodes in the metrics

How to reproduce it (as minimally and precisely as possible):

I have a typical configuration, no metrics are being changed or adjusted by that time. So install KSM with Helm, add an annotation to any pod, get metrics.

{
  "__name__": "kube_pod_annotations",
  "annotation_build_url": "...",
  "annotation_jenkins_template": "...",
  "app_kubernetes_io_component": "metrics",
  "app_kubernetes_io_instance": "prometheus",
  "app_kubernetes_io_managed_by": "Helm",
  "app_kubernetes_io_name": "kube-state-metrics",
  "app_kubernetes_io_part_of": "kube-state-metrics",
  "app_kubernetes_io_version": "2.5.0",
  "environment": "...",
  "helm_sh_chart": "kube-state-metrics-4.13.0",
  "instance": "...:8080",
  "job": "kubernetes-service-endpoints",
  "namespace": "...",
  "node": "k8s-node55...",
  "pod": "pod-xxx",
  "service": "prometheus-kube-state-metrics",
  "uid": "c8ec8299-85c2-48d5-b531-6f50acde9071"
}

Anything else we need to know?:

Everything works with fine other pod metrics, e.g. kube_pod_info or kube_pod_labels

Environment:

kube-state-metrics version: 2.5.0
Kubernetes version (use kubectl version): v1.22.5
Cloud provider or hardware configuration: bare metal servers
Other info:

Dec 08 '22 11:12 vyakovlev-hw

/triage accepted /assign @rexagod

Dec 15 '22 17:12 dashpole

@vyakovlev-hw Maybe I'm missing something but, I'm not sure how you're able to see the node label in kube_pod_annotations, since we are not exposing any as of v2.5.0 or even now.

Dec 16 '22 10:12 rexagod

@rexagod That's weird then: we haven't changed anything and that's what I have for kube_pod_annotations{job="kubernetes-service-endpoints"} (some labels were removed):

[
  {
    "metric": {
      "__name__": "kube_pod_annotations",
      "annotation_run_url": "job/xxx/job/yyy/job/bla/51/",
      "environment": "k8s-xxx",
      "instance": "10.xxx.yyy.zz7:8080",
      "namespace": "xxx-xxx",
      "node": "k8s-nodeXX.something.com",
      "pod": "aa-bb-cc-dd",
      "uid": "3c7e7ec1-b825-4272-ace9-3d800134d446"
    },
    "value": [
      1671193424.902,
      "1"
    ],
    "group": 1
  },
  {
    "metric": {
      "__name__": "kube_pod_annotations",
      "annotation_run_url": "job/xxx/job/yyy/job/bla/51/",
      "environment": "k8s-xxx",
      "instance": "10.xxx.yyy.zz7:8080",
      "job": "kubernetes-service-endpoints",
      "namespace": "xxx-xxx",
      "node": "k8s-nodeXX.something.com",
      "pod": "aa-bb-cc-dd",
      "uid": "fd6a5712-c277-49c2-bc6a-a6dbe539e138"
    },
    "value": [
      1671193424.902,
      "1"
    ],
    "group": 1
  },
  {
    "metric": {
      "__name__": "kube_pod_annotations",
      "annotation_jenkins_template": "xxx-yyy",
      "annotation_run_url": "job/xxx/job/yyy/job/bla/7/",
      "environment": "k8s-xxx",
      "instance": "10.xxx.yyy.zz7:8080",
      "job": "kubernetes-service-endpoints",
      "namespace": "xxx-xxx",
      "node": "k8s-nodeXX.something.com",
      "pod": "aa-bb-cc-dd",
      "uid": "e7d6b497-e133-4214-9c6b-d0c09f425592"
    },
    "value": [
      1671193424.902,
      "1"
    ],
    "group": 1
  }
]

I have checked all Victoria Metrics rules so that we don't add this label somehow - no we don't.

Dec 16 '22 12:12 vyakovlev-hw

@rexagod Hello, this bug affects not only kube_pod_annotations but also kube_pod_labels.

Environment:

kube-state-metrics version: 2.9.2
kube-state-metrics Helm cart version: 5.11.0
Kubernetes version (use kubectl version): v1.24.17
Cloud provider or hardware configuration: bare metal

We have a test cluster with multiple nodes (each having the name: clusternX).

Looking at our Prometheus config, I see the following section under relabel_configs of each of the jobs kubernetes-service-endpoints, kubernetes-service-endpoints-slow, kubernetes-pods, kubernetes-pods-slow:

    - source_labels: [__meta_kubernetes_pod_node_name]
      separator: ;
      regex: (.*)
      target_label: node
      replacement: $1
      action: replace

I have just created a plain alpine pod in the default namespace using this yaml file:

apiVersion: v1
kind: Pod
metadata:
  name: alpine
spec:
  containers:
  - image: alpine:latest
    command:
      - /bin/sh
      - "-c"
      - "sleep 60m"
    imagePullPolicy: IfNotPresent
    name: alpine

And see it is running on node clustern2:

$ kubectl create -f alpine.yaml
$ kubectl describe po alpine | grep Node
Node:             clustern2/192.168.70.102
Node-Selectors:              <none>

This is what I see querying for the different metrics:

kube_pod_info{pod="alpine"}

kube_pod_info{app_kubernetes_io_component="metrics", app_kubernetes_io_instance="prometheus", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="kube-state-metrics", app_kubernetes_io_part_of="kube-state-metrics", app_kubernetes_io_version="2.9.2", helm_sh_chart="kube-state-metrics-5.11.0", host_ip="192.168.70.102", host_network="false", instance="10.0.1.80:8080", job="kubernetes-service-endpoints", namespace="default", node="clustern2", pod="alpine", pod_ip="10.0.12.60", service="prometheus-kube-state-metrics", uid="5f6720c6-933e-49de-80fd-e301e6aa1367"}

kube_pod_labels{pod="alpine"}

kube_pod_labels{app_kubernetes_io_component="metrics", app_kubernetes_io_instance="prometheus", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="kube-state-metrics", app_kubernetes_io_part_of="kube-state-metrics", app_kubernetes_io_version="2.9.2", helm_sh_chart="kube-state-metrics-5.11.0", instance="10.0.1.80:8080", job="kubernetes-service-endpoints", namespace="default", node="clustern1", pod="alpine", service="prometheus-kube-state-metrics", uid="5f6720c6-933e-49de-80fd-e301e6aa1367"}

kube_pod_annotations{pod="alpine"}

kube_pod_annotations{app_kubernetes_io_component="metrics", app_kubernetes_io_instance="prometheus", app_kubernetes_io_managed_by="Helm", app_kubernetes_io_name="kube-state-metrics", app_kubernetes_io_part_of="kube-state-metrics", app_kubernetes_io_version="2.9.2", helm_sh_chart="kube-state-metrics-5.11.0", instance="10.0.1.80:8080", job="kubernetes-service-endpoints", namespace="default", node="clustern1", pod="alpine", service="prometheus-kube-state-metrics", uid="5f6720c6-933e-49de-80fd-e301e6aa1367"}

The alpine pod has been scheduled and running on clustern2 node only, it has and never had anything to do with clustern1 node. Our assumption is that we get the wrong node label as this is the node where the kube-state-metrics pod is running:

$ kubectl describe po prometheus-kube-state-metrics-76d96875dc-9qhl2 -n monitoring | grep Node
Node:             clustern1/192.168.72.101
Node-Selectors:              <none>

Nov 14 '23 15:11 tsipo

After looking more into it, it seems to me that only for kube_pod_info the node label is replaced correctly (using the relabel config mechanism mentioned above). It is wrong also for queries like kube_pod_container_info and also appears on queries like kube_deployment_labels where it is not relevant.

Nov 17 '23 19:11 tsipo

kube-state-metrics kube-state-metrics copied to clipboard

kube_pod_annotations reports incorrect nodes (all metrics for the same node)

kube-state-metrics
kube-state-metrics copied to clipboard