armada Pods can become stuck forever on bad nodes

Pods can become stuck forever on bad nodes

Open JamesMurkin opened this issue 3 years ago • 1 comments

Problem

Pods can go to Pending and then get stuck forever - seems to only happens if the node is bad

Cause

We typically detect stuck pods and retry them https://github.com/G-Research/armada/blob/master/config/executor/config.yaml#L43

However if the pod is:

Assigned to a node
Has no events
Has no container statuses

Then it'll never meet any condition for retry

As this only happens on bad nodes, I think the best course of action is to review the code and check what will happen in the above scenario and add a way to detect it, so it can be handled (retried)

Jul 29 '22 16:07 JamesMurkin

Hey team! Please add your planning poker estimate with ZenHub @dejanzele @jayofdoom @kannon92 @richscott

Aug 09 '22 13:08 dave-gantenbein

@JamesMurkin How can you tell if a pod is assigned to a node?

And if the node is bad, why would we want to retry a pod?

Sep 01 '22 18:09 kannon92

armada armada copied to clipboard

Pods can become stuck forever on bad nodes

armada
armada copied to clipboard