kubespawner
kubespawner copied to clipboard
Handle pods in Unknown state
When nodes in a cluster misbehave, they sometimes leave pods in an Unknown state:
jupyter-user1 0/1 Unknown 0 3h 10.244.28.17 k8s-pool1-12345678-12
jupyter-user2 0/1 Unknown 1 5h 10.244.28.8 k8s-pool1-12345678-12
jupyter-user3 0/1 Unknown 1 4h 10.244.28.11 k8s-pool1-12345678-12
jupyter-user4 1/1 Unknown 0 6h <none> k8s-pool1-12345678-12
jupyter-user5 0/1 Unknown 0 3h 10.244.28.14 k8s-pool1-12345678-12
jupyter-user6 0/1 Unknown 1 3h 10.244.28.15 k8s-pool1-12345678-12
jupyter-user7 0/1 Unknown 1 4h 10.244.28.9 k8s-pool1-12345678-12
I can invoke delete on the pods but they don't actually go away. The pods are only cleared when the node has been rebooted -- stopping the node is insufficient.
Can KubeSpawner work around this? Or should it even try, give that kubernetes seems to be at fault here?
We had the same problem with the mybinder.org cluster. To remove these pods you need to use --force --grace-period=0. We added a snippet to the mybinder SRE guide http://mybinder-sre.readthedocs.io/en/latest/command_snippets.html#forcibly-delete-a-pod
Not sure what kuebspawner could do. I think Unknown and NodeLost happen when k8s isn't 100% sure what the state of the pod is, which means having a computer/robot act on that might lead to unpredictable behaviour/it doing the wrong thing?
Thanks @betatim! This is the third node we've had go into this state in the past couple of months and I think I tried that on the first occasion without success. But with any (bad) luck it will happen again soon and I'll verify whether it works or not.
Should we have an option to forcibly delete those unknown-state pods after some timeout?
I ran into this again today on PAWS and saw kubeflow/tf-operator#959. Perhaps we do want to have them removed in an automated way provided the node is lost?
It is a situation that will require admin attention (rebooting the lost node? fixing the error?) but minimizing user issues seems like a worthwhile goal.
Since user pods should not generally be automatically restarted, I think Unknown should probably be treated as a dead pod and cleaned up the same as if it has exited uncleanly.
But because this is likely an issue with the cluster, deleting a pod that is Unknown should probably come with some extra warning output about the state of the pod before deleting it, to help with diagnostics.
But because this is likely an issue with the cluster, deleting a pod that is Unknown should probably come with some extra warning output about the state of the pod before deleting it, to help with diagnostics.
But because this is likely an issue with the cluster, deleting a pod that is Unknown should probably come with some extra warning output about the state of the pod before deleting it, to help with diagnostics.
Sounds good. I'll try to start working on this next week.
I just ran into this again. Anyone else interested in picking this up?