mybinder.org-deploy icon indicating copy to clipboard operation
mybinder.org-deploy copied to clipboard

Break down Pending vs Terminating status

Open rgaiacs opened this issue 1 year ago • 1 comments

When I look at Graphana, I see

Screenshot 2024-08-02 at 16-14-01 Pod Activity - Dashbo

I checked

sum(label_replace(kube_pod_status_phase{phase="Pending",pod=~"jupyter-.*"}, "repo", "$1", "pod", "jupyter-(.+)-[^-]+")) by (repo)

on Prometheus and I got

Screenshot 2024-08-02 at 16-16-40 Prometheus Time Serie

Based on the information that Prometheus provided, the information on Graphana is wrong because Graphana mentions 25 pending pods but Prometheus only reports 6.

I looked at Kubernetes and the number of pending pods is only 6 as reported by Prometheus.

Screenshot 2024-08-02 161434

But we have 19 "Terminating" pods:

Screenshot 2024-08-02 161506

My understanding is that Graphana is merging "Pending" and "Terminating". I looked at the expression used by Graphana

sum(kube_pod_status_phase{pod=~\"^jupyter-.*\", kubernetes_namespace!=\"jhub-ns\"}) by (phase)

This expression looks good to me. This means that the problem is at the metric exporting part.

In https://github.com/kubernetes/kube-state-metrics/pull/1013, someone said

I have a pod in status Terminating but whith kube-state-metrics:v2.7.0 can not see kube_pod_status_phase{phase="Terminating"}

@sgibson91 and @manics can you help me to have the pod "Terminating" state exported? Thanks!

rgaiacs avatar Aug 02 '24 14:08 rgaiacs

https://github.com/kubernetes/kube-state-metrics/blob/b1c2e0c1cf897202fa10da7b622e883df8a7a66e/docs/metrics/workload/pod-metrics.md#useful-metrics-queries suggests count(kube_pod_deletion_timestamp) by (namespace, pod) * count(kube_pod_status_reason{reason="NodeLost"} == 0) by (namespace, pod)

manics avatar Aug 05 '24 20:08 manics