kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

Support kube_pod_ready_time metric

Open sgrzemski opened this issue 3 years ago • 17 comments

What would you like to be added:

Hello kube-state-metrics team! I am a happy user of your metrics software. I would like the kube-state-metrics to report also the time, when pod became ready (passing readiness probe). According to docs, there are already a couple of gauges in seconds: kube_pod_start_time, kube_pod_container_state_started, etc.

Why is this needed:

I would like to be able to measure the time needed for the container to become fully operational and healthy. I already have the created and start timestamp, so a simple delta query in prometheus would do the trick if a metric reporting the ready time would be implemented.

Describe the solution you'd like

Query the Kubernetes API to get the ready timestamp.

Additional context

sgrzemski avatar Apr 26 '21 06:04 sgrzemski

Hey 👋 Can you explain where this API is at? If k8s API reports this, it sounds good to me.

Note that ContainerState is the only thing that reports StartedAt. I haven't looked into StartTime if that can be used somehow.

lilic avatar Apr 29 '21 13:04 lilic

Pardon my delay, I was off for some time. I took a look at the code and it looks like kube-state-metrics uses Pod objects, Pod.Status.StartTime specifically, to create the kube_pod_start_time metric. However, according to Pod Lifecycle docs, PodStatus should have an array of PodConditions, containing the following information:

  • PodScheduled: the Pod has been scheduled to a node.
  • ContainersReady: all containers in the Pod are ready.
  • Initialized: all init containers have started successfully.
  • Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

Those come with two useful properties called:

  • lastProbeTime: Timestamp of when the Pod condition was last probed.
  • lastTransitionTime: Timestamp for when the Pod last transitioned from one status to another.

This information should be enough to form a metric called kube_pod_ready_time and with a simple PromQL get the time needed for the pod to start.

sgrzemski avatar May 14 '21 09:05 sgrzemski

I've patched v1.9.8 release with some additional code to report both ContainersReady and Ready timestamps and the transition between states can happen multiple times (e.g. pod stopped passing readiness probes). Quoting Pod Lifecycle docs:

Pod is evaluated to be ready only when both the following statements apply:

All containers in the Pod are ready.
All conditions specified in readinessGates are True.

I will change those metrics to report latest timestamp and match with your current convention: kube_pod_status_ready_time and kube_pod_status_containers_ready_time and prepare a PR.

sgrzemski avatar May 17 '21 07:05 sgrzemski

I'm pretty sure the kubelet exposes metrics about the readiness probes. I think it's the kubelet's responsibility to expose this.

brancz avatar Jun 07 '21 13:06 brancz

I'm pretty sure the kubelet exposes metrics about the readiness probes. I think it's the kubelet's responsibility to expose this.

I am running 1.19+ and I am seeing kubelet_pod_start_duration_seconds_bucket, _sum and _count in Prometheus, but they are node level, not per specific pods.

szymon-grzemski avatar Jun 14 '21 12:06 szymon-grzemski

@lilic @brancz any plans to merge this? looking forward to use this metric

slamdev avatar Aug 10 '21 06:08 slamdev

I was looking for something just like this! Was there something blocking this from getting merged into kube state metrics 2?

kevinwubert avatar Oct 15 '21 22:10 kevinwubert

It looks like the PR has gone stale. Would you be interested in wrapping up the work?

fpetkovski avatar Oct 18 '21 11:10 fpetkovski

Would love to! I will update in the next couple of days.

sgrzemski avatar Oct 18 '21 12:10 sgrzemski

:+1: This is useful for doing accurate total pod start time calculation, instead of trying to infer from ready count or something, in my particular case trying to benchmark the effect of some Istio sidecar settings on startup time. any updates @sgrzemski ?

SpectralHiss avatar Dec 15 '21 11:12 SpectralHiss

Looking forward to testing out this feature. @sgrzemski Are we stuck somewhere?

My team has a similar use case where we're trying to figure out the time it takes a pod to be scheduled to a particular node. We can get hold of the time when the pod transitioned to PodScheduled and report that as a metric.

PrayagS avatar Feb 16 '22 10:02 PrayagS

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 17 '22 10:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 16 '22 11:06 k8s-triage-robot

This would be really nice to have, @sgrzemski

stat-johan avatar Jul 01 '22 10:07 stat-johan

/remove-lifecycle rotten

fpetkovski avatar Jul 01 '22 17:07 fpetkovski

/remove-lifecycle stale

fpetkovski avatar Jul 01 '22 17:07 fpetkovski

Looking forward for this PR be merged, because I find that the "kube_pod_status_ready" has 2s delay to show ready status which compare with Ready timestamp from pod condition. I have reported a Issue, but no response yet. https://github.com/kubernetes/kube-state-metrics/issues/1830

So, we can't rely on "kube_pod_status_ready". If we want to calculate POD startup time, that's an issue. It's always an issue by using metrics to calculate time, so we need a metrics can return ready timestamp which can get from pod conditions.

qingguee avatar Sep 16 '22 00:09 qingguee

This would be a really great metric to have. Helps us to understand the time taken for services to come up in cluster.

sumanthkumarc avatar Sep 27 '22 12:09 sumanthkumarc

The metric would be incredibly valuable! For example to know:

  • Seconds until pod is scheduled
  • Seconds until pod is Ready

max-rocket-internet avatar Dec 14 '22 15:12 max-rocket-internet

I'm still pretty new to Prometheus, but I'm using this query to collect an almost equivalent metric. Please let me know if you find this useful or foresee any issues with it: sort_desc(max(sum_over_time(kube_pod_status_phase{namespace=~"$namespace", phase="Pending"}[$__range])/4) by (pod)) This returns the approximate (30 sec accuracy) time (in min) for each pod in the pending state.

coleary-hyperscience avatar Dec 28 '22 20:12 coleary-hyperscience

I'm still pretty new to Prometheus, but I'm using this query to collect an almost equivalent metric. Please let me know if you find this useful or foresee any issues with it: sort_desc(max(sum_over_time(kube_pod_status_phase{namespace=~"$namespace", phase="Pending"}[$__range])/4) by (pod)) This returns the approximate (30 sec accuracy) time (in min) for each pod in the pending state.

Could I know why the /4 is required?

vijaynidhi85 avatar Jan 10 '23 10:01 vijaynidhi85

Could I know why the /4 is required?

For sure, it looks to me like kube_pod_status_phase pings 4 times every min. So it is to convert the sum of those 4 pings to a min rate.

Not a great way to do it, but seems to be working for me, at least until this other PR gets merged.

coleary-hyperscience avatar Jan 11 '23 16:01 coleary-hyperscience