awesome-prometheus-alerts
awesome-prometheus-alerts copied to clipboard
Alert KubernetesPodNotHealthy reporting incorrect alerts
The way the following alert works is (from my understanding), that is any Pod that is "Pending|Unknown|Failed" state for longer than the default resolution in the last hour will trigger the alert. At least that's how the alert is firing for me. The Alert description says something else, the pod should be down for longer than an hour.
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
I'm no expert on PromQL but maybe the range/resolution has to be changed like this: [1h:1h]?
Ok, this is weird.
I'll write a new query using [1h:1m].
Thanks for your feedback @mastaab!
I don't think this is firing right now... Basically it works now such that if the pod is down/pending/whatever for more than 1 minute it fires... Should it be [15m:1m] and >= 15?
Not a PromQL expert at all, but what about the following?:
- alert: KubernetesPodNotHealthy
expr: kube_pod_status_phase{phase=~"Pending|Unknown|Failed"} > 0
for: 1h
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
Reminder: this was not yet changed on the main website. And the query truely doesn't do what it intends to do. A few seconds of unavailability suffice to fire that alert.
I have no Kube cluster running on my side. Can you write a PR @gna582 with a better query please?