kube-state-metrics
kube-state-metrics copied to clipboard
Support kube_pod_ready_time metric
What would you like to be added:
Hello kube-state-metrics team!
I am a happy user of your metrics software. I would like the kube-state-metrics to report also the time, when pod became ready (passing readiness probe). According to docs, there are already a couple of gauges in seconds: kube_pod_start_time
, kube_pod_container_state_started
, etc.
Why is this needed:
I would like to be able to measure the time needed for the container to become fully operational and healthy. I already have the created and start timestamp, so a simple delta query in prometheus would do the trick if a metric reporting the ready time would be implemented.
Describe the solution you'd like
Query the Kubernetes API to get the ready timestamp.
Additional context
Hey 👋 Can you explain where this API is at? If k8s API reports this, it sounds good to me.
Note that ContainerState
is the only thing that reports StartedAt
. I haven't looked into StartTime
if that can be used somehow.
Pardon my delay, I was off for some time.
I took a look at the code and it looks like kube-state-metrics uses Pod objects, Pod.Status.StartTime
specifically, to create the kube_pod_start_time
metric. However, according to Pod Lifecycle docs, PodStatus should have an array of PodConditions, containing the following information:
- PodScheduled: the Pod has been scheduled to a node.
- ContainersReady: all containers in the Pod are ready.
- Initialized: all init containers have started successfully.
- Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.
Those come with two useful properties called:
- lastProbeTime: Timestamp of when the Pod condition was last probed.
- lastTransitionTime: Timestamp for when the Pod last transitioned from one status to another.
This information should be enough to form a metric called kube_pod_ready_time
and with a simple PromQL get the time needed for the pod to start.
I've patched v1.9.8 release with some additional code to report both ContainersReady and Ready timestamps and the transition between states can happen multiple times (e.g. pod stopped passing readiness probes). Quoting Pod Lifecycle docs:
Pod is evaluated to be ready only when both the following statements apply:
All containers in the Pod are ready.
All conditions specified in readinessGates are True.
I will change those metrics to report latest timestamp and match with your current convention: kube_pod_status_ready_time
and kube_pod_status_containers_ready_time
and prepare a PR.
I'm pretty sure the kubelet exposes metrics about the readiness probes. I think it's the kubelet's responsibility to expose this.
I'm pretty sure the kubelet exposes metrics about the readiness probes. I think it's the kubelet's responsibility to expose this.
I am running 1.19+ and I am seeing kubelet_pod_start_duration_seconds_bucket, _sum and _count
in Prometheus, but they are node level, not per specific pods.
@lilic @brancz any plans to merge this? looking forward to use this metric
I was looking for something just like this! Was there something blocking this from getting merged into kube state metrics 2?
It looks like the PR has gone stale. Would you be interested in wrapping up the work?
Would love to! I will update in the next couple of days.
:+1: This is useful for doing accurate total pod start time calculation, instead of trying to infer from ready count or something, in my particular case trying to benchmark the effect of some Istio sidecar settings on startup time. any updates @sgrzemski ?
Looking forward to testing out this feature. @sgrzemski Are we stuck somewhere?
My team has a similar use case where we're trying to figure out the time it takes a pod to be scheduled to a particular node. We can get hold of the time when the pod transitioned to PodScheduled
and report that as a metric.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
This would be really nice to have, @sgrzemski
/remove-lifecycle rotten
/remove-lifecycle stale
Looking forward for this PR be merged, because I find that the "kube_pod_status_ready" has 2s delay to show ready status which compare with Ready timestamp from pod condition. I have reported a Issue, but no response yet. https://github.com/kubernetes/kube-state-metrics/issues/1830
So, we can't rely on "kube_pod_status_ready". If we want to calculate POD startup time, that's an issue. It's always an issue by using metrics to calculate time, so we need a metrics can return ready timestamp which can get from pod conditions.
This would be a really great metric to have. Helps us to understand the time taken for services to come up in cluster.
The metric would be incredibly valuable! For example to know:
- Seconds until pod is scheduled
- Seconds until pod is
Ready
I'm still pretty new to Prometheus, but I'm using this query to collect an almost equivalent metric. Please let me know if you find this useful or foresee any issues with it: sort_desc(max(sum_over_time(kube_pod_status_phase{namespace=~"$namespace", phase="Pending"}[$__range])/4) by (pod))
This returns the approximate (30 sec accuracy) time (in min) for each pod in the pending state.
I'm still pretty new to Prometheus, but I'm using this query to collect an almost equivalent metric. Please let me know if you find this useful or foresee any issues with it:
sort_desc(max(sum_over_time(kube_pod_status_phase{namespace=~"$namespace", phase="Pending"}[$__range])/4) by (pod))
This returns the approximate (30 sec accuracy) time (in min) for each pod in the pending state.
Could I know why the /4 is required?
Could I know why the /4 is required?
For sure, it looks to me like kube_pod_status_phase
pings 4 times every min. So it is to convert the sum of those 4 pings to a min rate.
Not a great way to do it, but seems to be working for me, at least until this other PR gets merged.