kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

kube_pod_completion_time is not returned for some pods

Open jicki opened this issue 1 year ago • 7 comments

What happened:

kube_pod_completion_time is not returned for some pods

What you expected to happen:

Specifically I want to get the kube_pod_completion_time metrics:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The metrics kube_pod_start_time and kube_pod_created can correctly collect data.

prometheus scrape Interval kube-state-metrics 5s

Environment:

  • kube-state-metrics version: 2.6.0
  • Kubernetes version (use kubectl version): 1.22.2
  • Cloud provider or hardware configuration:
  • Other info:

image

jicki avatar Jul 17 '23 12:07 jicki

/assign @dgrisonnet /triage accepted

dashpole avatar Jul 27 '23 16:07 dashpole

Could it perhaps be because these pods are running and as such are not completed yet?

dgrisonnet avatar Jul 28 '23 16:07 dgrisonnet

Could it perhaps be because these pods are running and as such are not completed yet?

I executed creating a pod in the test, and then executed the delete operation, and waited for the deletion to complete before querying again.

jicki avatar Jul 31 '23 07:07 jicki

We met the same issue. I think here is the reason: the pod is completed too fast for promthues to grab the metrics.

In our case, we set promthues grabbing interval to 30s. If the pod is completed and deleted within this interval, it won't be grabbed by promthues.

Our workaround: delay to delete the metric by 60 seconds

// Delete deletes an existing entry in the MetricsStore.
// Delete deletes an existing entry in the MetricsStore.
func (s *MetricsStore) Delete(obj interface{}) error {
	o, err := meta.Accessor(obj)
	if err != nil {
		return err
	}
	go func(uuid types.UID) {
		time.Sleep(60 * time.Second)
		s.mutex.Lock()
		defer s.mutex.Unlock()
	
		delete(s.metrics,uuid)
	}( o.GetUID())

	return nil
}

N.B. if there are too many pods to be deleted, it will spawn too many goroutines.

umialpha avatar Jan 19 '24 09:01 umialpha

I have the same issue. In my case I would like to get for how long a pod did run, so that then I can calculate the max, min and mean. I'm trying to use a query like kube_pod_completion_time - kube_pod_created but it doesn't work since the metric kube_pod_completion_time only gets data points from a single pod that ends up with the status Completed.

I do suspect this metric only works on pods that end with status Completed and not the ones that end with the status Terminating or maybe the pods are not available long enough in this status to allow kube-state-metrics to scrape a data point while in the Terminating status.

My scrape interval is 15s.

Can you please give support on this?

danielserrao avatar Jan 31 '24 11:01 danielserrao

Same issue happened to me. We can't get kube_pod_completion_time or kube_pod_created metrics for very short lived pods (jobs run through Argo Workflow).

fredsig avatar Feb 26 '24 11:02 fredsig

We normally always get pod completion times for pods launched with Jobs. The jobs have ttlSecondsAfterFinished set to ~ a day and the pods linger in Completed status for at least several hours.

However in other cases, such as with Deployments, the replicaSet controller of the pod immediately removes completed or terminated pods, so they are not present long enough for KSM to register the pod completion time and/or for Prometheus to record it - not sure which.

For the most strictly correct solution, I have the feeling that Deployments/Replicasets and other controllers should have some configuration to allow finished pods to linger in Completed state for a certain time period. Not sure if finalizers or anything else could be used as a workaround.

rptaylor avatar Apr 29 '24 23:04 rptaylor