kubespawner Handle pods in Unknown state

When nodes in a cluster misbehave, they sometimes leave pods in an Unknown state:

jupyter-user1                   0/1       Unknown   0          3h        10.244.28.17    k8s-pool1-12345678-12
jupyter-user2                   0/1       Unknown   1          5h        10.244.28.8     k8s-pool1-12345678-12
jupyter-user3                   0/1       Unknown   1          4h        10.244.28.11    k8s-pool1-12345678-12
jupyter-user4                   1/1       Unknown   0          6h        <none>          k8s-pool1-12345678-12
jupyter-user5                   0/1       Unknown   0          3h        10.244.28.14    k8s-pool1-12345678-12
jupyter-user6                   0/1       Unknown   1          3h        10.244.28.15    k8s-pool1-12345678-12
jupyter-user7                   0/1       Unknown   1          4h        10.244.28.9     k8s-pool1-12345678-12

I can invoke delete on the pods but they don't actually go away. The pods are only cleared when the node has been rebooted -- stopping the node is insufficient.

Can KubeSpawner work around this? Or should it even try, give that kubernetes seems to be at fault here?

Mar 16 '18 07:03 ryanlovett

We had the same problem with the mybinder.org cluster. To remove these pods you need to use --force --grace-period=0. We added a snippet to the mybinder SRE guide http://mybinder-sre.readthedocs.io/en/latest/command_snippets.html#forcibly-delete-a-pod

Not sure what kuebspawner could do. I think Unknown and NodeLost happen when k8s isn't 100% sure what the state of the pod is, which means having a computer/robot act on that might lead to unpredictable behaviour/it doing the wrong thing?

Mar 16 '18 11:03 betatim

Thanks @betatim! This is the third node we've had go into this state in the past couple of months and I think I tried that on the first occasion without success. But with any (bad) luck it will happen again soon and I'll verify whether it works or not.

Mar 16 '18 16:03 ryanlovett

Should we have an option to forcibly delete those unknown-state pods after some timeout?

Mar 26 '18 14:03 clkao

I ran into this again today on PAWS and saw kubeflow/tf-operator#959. Perhaps we do want to have them removed in an automated way provided the node is lost?

It is a situation that will require admin attention (rebooting the lost node? fixing the error?) but minimizing user issues seems like a worthwhile goal.

Mar 18 '19 19:03 chicocvenancio

Since user pods should not generally be automatically restarted, I think Unknown should probably be treated as a dead pod and cleaned up the same as if it has exited uncleanly.

Mar 25 '19 12:03 minrk

But because this is likely an issue with the cluster, deleting a pod that is Unknown should probably come with some extra warning output about the state of the pod before deleting it, to help with diagnostics.

Mar 25 '19 13:03 minrk

But because this is likely an issue with the cluster, deleting a pod that is Unknown should probably come with some extra warning output about the state of the pod before deleting it, to help with diagnostics.

Sounds good. I'll try to start working on this next week.

Mar 25 '19 13:03 chicocvenancio

I just ran into this again. Anyone else interested in picking this up?

Nov 20 '19 17:11 yuvipanda

kubespawner kubespawner copied to clipboard

Handle pods in Unknown state

kubespawner
kubespawner copied to clipboard