cloud-platform icon indicating copy to clipboard operation
cloud-platform copied to clipboard

Alert for nodes in NotReady state

Open sj-williams opened this issue 4 months ago • 0 comments

Background

We observed a worker node in manager cluster being "stuck" in NotReady state for an extended period of time (over 30 mins at time of discovery).

Cordon and draining didn't seem to resolve, instead we were stuck with all the pods in Terminating state. This was only resolved when we began force deleting the pods one by one, and upon kicking an ingress controller pod, the node state returned to normal again.

We might want to consider having an alert that raises this in slack if a node is unexpectedly in this state for a certain length of time (to avoid normal situations for this state)

sj-williams avatar Oct 25 '24 16:10 sj-williams