kube-state-metrics deleted pods still reporting metrics

What happened:

it seems that sometimes metrics don't get deleted alongside the pod. It isn't until we churn all the kube-state-metrics pods that it fixes it.

What's even stranger is that it won't be all metrics for that pod that will incorrectly exist; for example, for a particular pod that was deleted, we noticed that it was still reporting kube_pod_container_status_waiting_reason, but not kube_pod_container_resource_requests.

What you expected to happen:

When a pod gets deleted, all metrics associated with that pod should also be deleted.

How to reproduce it (as minimally and precisely as possible):

It's unclear as to how this happens - whenever we try to reproduce by manually deleting a pod and querying for all its metrics ({pod="my_pod"}), it seems to work just fine, i.e. the metrics all disappear.

Anything else we need to know?:

Environment:

kube-state-metrics version: 2.2.0 (though we were experiencing this on 1.5.0 as well)
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: self-hosted k8s on aws
Other info:

      /kube-state-metrics
      --port=9102
      --telemetry-port=8081
      --resources=configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets
      --use-apiserver-cache
      --metric-labels-allowlist=daemonsets=[*],deployments=[*],jobs=[*],nodes=[*],pods=[*],secrets=[*]
      --pod=$(POD_NAME)
      --pod-namespace=$(POD_NAMESPACE)

Sep 02 '21 01:09 jpdstan

This could be related to https://github.com/kubernetes/kube-state-metrics/issues/694

Sep 02 '21 06:09 fpetkovski

Have you checked via kubectl that the pods in this state are actually deleted, and not in some non running state, such as Completed or Evicted?

Oct 11 '21 07:10 fredr

@fredr Yes, they are definitely deleted.

Oct 11 '21 19:10 jpdstan

Same thing happening to me on EKS

Dec 22 '21 22:12 irl-segfault

Seeing another instance of this. These two metrics existed at the same time for the pod named taskmanager-0... the IP addresses differ because one IP is old one and the other IP is current one.

kube_pod_labels{
 host="1.1.147.202"
 instance="1.1.147.202:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-4"
 pod="taskmanager-0"
 ...
}

kube_pod_labels{
 host="1.1.188.37"
 instance="1.1.188.37:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-8"
 pod="taskmanager-0"
 ...
}

Feb 01 '22 01:02 jpdstan

Happens to me with kube_pod_container_resource_requests and "Terminated" pods (but not yet removed by terminated pod garbage collector). KSM version: kube-state-metrics/kube-state-metrics:v2.4.1 I would expect that kube_pod_container_resource_requests would not return terminated pods (or at least expect them correctly labelled so I can filter them).

Mar 31 '22 04:03 boniek83

This case is expected since KSM exposes everything from the apiserver. If you are not interested in terminated pods, you can drop the series using relabeling.

Mar 31 '22 06:03 fpetkovski

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 29 '22 07:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 29 '22 07:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 28 '22 08:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 28 '22 08:08 k8s-ci-robot

kube-state-metrics kube-state-metrics copied to clipboard

deleted pods still reporting metrics

kube-state-metrics
kube-state-metrics copied to clipboard