kubespawner
kubespawner copied to clipboard
User's single-user notebook server Pod in k8s is not getting deleted if the pod fails to start
User's single-user notebook server Pod in k8s is not getting deleted if the pod fails to start due to any of the below reasons:
- If the pod is in crashloop backoff
- Pod got evicted
- Pod is in pending state due to unavailability of resource in cluster I found that even after calling DELETE /users/{name}/servers/{server_name} this api it doesn't delete the pod and the entry from servers table is also not getting deleted it gives 404 error. Can anyone please help me rectify this.
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
I hit this too. It seems like jupyterhub/user.py should handle calling self.stop(spawner.name) if the spawner fails to start the notebook, but I'm not seeing that happen.
Additionally, even though I'm seeing logs indicate it was a timeout (because in my case, the pod failed due to OutOfcpu), the exception handler doesn't seem to properly detect it as a TimeoutError:
2020-09-02 04:34:39 [I 2020-09-02 11:34:39.010 JupyterHub log:174] 200 GET /hub/api/users/[email protected]/server/progress ([email protected]@172.16.16.11) 272912.36ms
2020-09-02 04:34:39
2020-09-02 04:34:39 TimeoutError: pod/jupyter-test-2etest-40example-2ecom did not start in 300 seconds!
2020-09-02 04:34:39 raise TimeoutError(fail_message)
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 178, in exponential_backoff
2020-09-02 04:34:39 timeout=self.start_timeout,
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 2005, in _start
2020-09-02 04:34:39 url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 560, in spawn
2020-09-02 04:34:39 raise e
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 656, in spawn
2020-09-02 04:34:39 await spawn_future
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 852, in finish_user_spawn
2020-09-02 04:34:39 future.result()
2020-09-02 04:34:39 File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 593, in error_callback
2020-09-02 04:34:39 Traceback (most recent call last):
2020-09-02 04:34:39 [E 2020-09-02 11:34:39.008 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:845> exception=TimeoutError('pod/jupyter-test-2etest-40example-2ecom did not start in 300 seconds!',)> after timeout
2020-09-02 04:34:39 secret test-2etest-40example-2ecom-auth-state deleted.
2020-09-02 04:34:38 [E 2020-09-02 11:34:38.972 JupyterHub user:640] Unhandled error starting [email protected]'s server: pod/jupyter-test-2etest-40example-2ecom did not start in 300 seconds!
More investigation shows it is calling self.stop() in jupyterhub/user.py because it's calling my Spawner.post_stop_hook, but Im still unclear why it didn't delete the pod.
Ah here's the cause I think: https://github.com/jupyterhub/jupyterhub/blob/master/jupyterhub/user.py#L795-L797
It checks if spawner.poll() returns None, if it does, it calls spawner.stop(). So my case at least (and yours probably) is not returning None from spawner.poll().
The issue is that poll() from kubespawner doesnt' handle all the cases of a pod not starting correctly. In particular, It doesnt handle the case when the status.containerStatuses of a pod is empty and the status.phase isn't running: https://github.com/jupyterhub/kubespawner/blob/master/kubespawner/spawner.py#L1568-L1574
In my case data.status.phase is Failed so data.status.phase == 'Pending' doesnt get hit, and instead it returns 1 here:
ctr_stat = data.status.container_statuses
if ctr_stat is None: # No status, no container (we hope)
# This seems to happen when a pod is idle-culled.
return 1
By returning 1 instead of None from spawner.poll(), the spawner doesn't stop the pod (as mentioned above).
Instead the logic probably needs to check status.phase more carefully. Instead of looking for Pending it should probably check if data.status.phase != "Running" or similar.
Here's my current WIP https://github.com/chancez/kubespawner/commit/e7db1dd70e6cbb64d1cbe7c7918662c8ce23add0 for this (in the fix_spawner_poll branch, which includes other changes we require for internal SSL).
@chancez I have a query what all statuses does status.phase returns. What is the status when pod goes in crashloop or evicted. And also I was thinking if a pod goes in pending state due to lack of memory what should be the best action to be taken. Should we wait for some time to see if pod gets the requested resource before deleting it
Possible phases are here: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Generally, once a pod has been scheduled, if the node doesn't have enough memory before being started, I don't think it will get started, ever.
If it gets scheduled, started, and killed due to it exceeding memory request/limits and is OOMed, the container will get Terminated (reflected in the container status), and the pod's status.phase should be Running. In this case the pod's status.conditions list will have a type: Ready condition which will be false until it is restarted successfully. This is true for crashloops too I believe.
I'm not 100% sure on what you would see for eviction, but likely something similar to the above, and eventually the pod will be deleted.