armada
armada copied to clipboard
Pods can become stuck forever on bad nodes
Problem
Pods can go to Pending and then get stuck forever - seems to only happens if the node is bad
Cause
We typically detect stuck pods and retry them https://github.com/G-Research/armada/blob/master/config/executor/config.yaml#L43
However if the pod is:
- Assigned to a node
- Has no events
- Has no container statuses
Then it'll never meet any condition for retry
As this only happens on bad nodes, I think the best course of action is to review the code and check what will happen in the above scenario and add a way to detect it, so it can be handled (retried)
Hey team! Please add your planning poker estimate with ZenHub @dejanzele @jayofdoom @kannon92 @richscott
@JamesMurkin How can you tell if a pod is assigned to a node?
And if the node is bad, why would we want to retry a pod?