armada icon indicating copy to clipboard operation
armada copied to clipboard

Pods can become stuck forever on bad nodes

Open JamesMurkin opened this issue 3 years ago • 1 comments

Problem

Pods can go to Pending and then get stuck forever - seems to only happens if the node is bad

Cause

We typically detect stuck pods and retry them https://github.com/G-Research/armada/blob/master/config/executor/config.yaml#L43

However if the pod is:

  • Assigned to a node
  • Has no events
  • Has no container statuses

Then it'll never meet any condition for retry

As this only happens on bad nodes, I think the best course of action is to review the code and check what will happen in the above scenario and add a way to detect it, so it can be handled (retried)

JamesMurkin avatar Jul 29 '22 16:07 JamesMurkin

Hey team! Please add your planning poker estimate with ZenHub @dejanzele @jayofdoom @kannon92 @richscott

dave-gantenbein avatar Aug 09 '22 13:08 dave-gantenbein

@JamesMurkin How can you tell if a pod is assigned to a node?

And if the node is bad, why would we want to retry a pod?

kannon92 avatar Sep 01 '22 18:09 kannon92