kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

deleted pods still reporting metrics

Open jpdstan opened this issue 3 years ago • 9 comments

What happened:

it seems that sometimes metrics don't get deleted alongside the pod. It isn't until we churn all the kube-state-metrics pods that it fixes it.

What's even stranger is that it won't be all metrics for that pod that will incorrectly exist; for example, for a particular pod that was deleted, we noticed that it was still reporting kube_pod_container_status_waiting_reason, but not kube_pod_container_resource_requests.

What you expected to happen:

When a pod gets deleted, all metrics associated with that pod should also be deleted.

How to reproduce it (as minimally and precisely as possible):

It's unclear as to how this happens - whenever we try to reproduce by manually deleting a pod and querying for all its metrics ({pod="my_pod"}), it seems to work just fine, i.e. the metrics all disappear.

Anything else we need to know?:

Environment:

  • kube-state-metrics version: 2.2.0 (though we were experiencing this on 1.5.0 as well)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.12", GitCommit:"e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725", GitTreeState:"clean", BuildDate:"2020-05-06T05:09:48Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: self-hosted k8s on aws
  • Other info:
      /kube-state-metrics
      --port=9102
      --telemetry-port=8081
      --resources=configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets
      --use-apiserver-cache
      --metric-labels-allowlist=daemonsets=[*],deployments=[*],jobs=[*],nodes=[*],pods=[*],secrets=[*]
      --pod=$(POD_NAME)
      --pod-namespace=$(POD_NAMESPACE)

jpdstan avatar Sep 02 '21 01:09 jpdstan

This could be related to https://github.com/kubernetes/kube-state-metrics/issues/694

fpetkovski avatar Sep 02 '21 06:09 fpetkovski

Have you checked via kubectl that the pods in this state are actually deleted, and not in some non running state, such as Completed or Evicted?

fredr avatar Oct 11 '21 07:10 fredr

@fredr Yes, they are definitely deleted.

jpdstan avatar Oct 11 '21 19:10 jpdstan

Same thing happening to me on EKS

irl-segfault avatar Dec 22 '21 22:12 irl-segfault

Seeing another instance of this. These two metrics existed at the same time for the pod named taskmanager-0... the IP addresses differ because one IP is old one and the other IP is current one.

kube_pod_labels{
 host="1.1.147.202"
 instance="1.1.147.202:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-4"
 pod="taskmanager-0"
 ...
}

kube_pod_labels{
 host="1.1.188.37"
 instance="1.1.188.37:9102"
 job="kubernetes-pods-k8s-production"
 kubernetes_namespace="kube-system"
 kubernetes_pod_name="kube-state-metrics-8"
 pod="taskmanager-0"
 ...
}

jpdstan avatar Feb 01 '22 01:02 jpdstan

Happens to me with kube_pod_container_resource_requests and "Terminated" pods (but not yet removed by terminated pod garbage collector). KSM version: kube-state-metrics/kube-state-metrics:v2.4.1 I would expect that kube_pod_container_resource_requests would not return terminated pods (or at least expect them correctly labelled so I can filter them).

boniek83 avatar Mar 31 '22 04:03 boniek83

This case is expected since KSM exposes everything from the apiserver. If you are not interested in terminated pods, you can drop the series using relabeling.

fpetkovski avatar Mar 31 '22 06:03 fpetkovski

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 29 '22 07:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 29 '22 07:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Aug 28 '22 08:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 28 '22 08:08 k8s-ci-robot