cloud-platform
cloud-platform copied to clipboard
Alert for nodes in NotReady state
Background
We observed a worker node in manager cluster being "stuck" in NotReady
state for an extended period of time (over 30 mins at time of discovery).
Cordon and draining didn't seem to resolve, instead we were stuck with all the pods in Terminating
state. This was only resolved when we began force deleting the pods one by one, and upon kicking an ingress controller pod, the node state returned to normal again.
We might want to consider having an alert that raises this in slack if a node is unexpectedly in this state for a certain length of time (to avoid normal situations for this state)