awesome-prometheus-alerts icon indicating copy to clipboard operation
awesome-prometheus-alerts copied to clipboard

Alert KubernetesPodNotHealthy reporting incorrect alerts

Open mastaab opened this issue 4 years ago • 5 comments

The way the following alert works is (from my understanding), that is any Pod that is "Pending|Unknown|Failed" state for longer than the default resolution in the last hour will trigger the alert. At least that's how the alert is firing for me. The Alert description says something else, the pod should be down for longer than an hour.

  - alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

I'm no expert on PromQL but maybe the range/resolution has to be changed like this: [1h:1h]?

mastaab avatar Apr 20 '21 08:04 mastaab

Ok, this is weird.

I'll write a new query using [1h:1m].

Thanks for your feedback @mastaab!

samber avatar May 01 '21 18:05 samber

I don't think this is firing right now... Basically it works now such that if the pod is down/pending/whatever for more than 1 minute it fires... Should it be [15m:1m] and >= 15?

snowzach avatar Jun 04 '21 18:06 snowzach

Not a PromQL expert at all, but what about the following?:

- alert: KubernetesPodNotHealthy
    expr: kube_pod_status_phase{phase=~"Pending|Unknown|Failed"} > 0
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

liorfranko avatar Jun 23 '21 19:06 liorfranko

Reminder: this was not yet changed on the main website. And the query truely doesn't do what it intends to do. A few seconds of unavailability suffice to fire that alert.

benedikt-haug avatar Oct 14 '21 08:10 benedikt-haug

I have no Kube cluster running on my side. Can you write a PR @gna582 with a better query please?

samber avatar Nov 01 '21 09:11 samber