kubespawner Resilience needed against k8s master outages

Bug description

During a kubernetes master outage, a few failed api calls leaves JupyterHub in an unrecoverable state. This causes user servers to fail to spin up.

Expected behaviour

JupyterHub can correctly recover itself after failed api calls to the k8s master.

Actual behaviour

JupyterHub is left in an unrecoverable state and user servers cannot start.

Some logs are available in the following issues:

https://github.com/2i2c-org/infrastructure/issues/1248#issuecomment-1110326838
https://github.com/2i2c-org/infrastructure/issues/1102#issuecomment-1200486502

How to reproduce

This is an intermittent issue we are seeing on cloud-based deployments.

Your personal set up

We are running z2jh on GCP clusters. We see this issue most frequently on non-regional clusters. Our clusters that are regional do not experience this as often, if at all, because the k8s master has high availability for these clusters.

You can find more information about how we deploy our hubs here: https://github.com/2i2c-org/infrastructure

Related issues

https://github.com/2i2c-org/infrastructure/issues/1102
https://github.com/2i2c-org/infrastructure/issues/1103
https://github.com/2i2c-org/infrastructure/issues/1135
https://github.com/2i2c-org/infrastructure/issues/1248
https://github.com/2i2c-org/infrastructure/issues/1616

Aug 12 '22 16:08 sgibson91

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

Aug 12 '22 16:08 welcome[bot]

@sgibson91 can you track down what JupyterHub / KubeSpawner version was used when this was observed?

Sorry, I could have done that myself - I conclude from this and the associated z2jh hub image with requirements.txt frozen here that jupyterhub has version 2.3.1 and kubespawner has version 4.1.0. The latest stable versions of both jupyterhub and kubespawner - at least assuming that the docker image used by 2i2c was rebuilt when bumping the helm chart version.

Aug 12 '22 16:08 consideRatio

The most recent tag was pushed ~2 weeks ago: https://quay.io/repository/2i2c/pilot-hub?tab=tags
The most recent commit on basehub's Chart.yaml file (i.e. where we define which version of the z2jh helm chart we depend on) was Jul 6, 2022: https://github.com/2i2c-org/infrastructure/commits/master/helm-charts/basehub/Chart.yaml

Hence I believe it is safe to assume that the docker image was rebuilt when, and after, bumping the helm chart version.

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

https://github.com/2i2c-org/infrastructure/blob/dd74fe1bcf2d7aeab814116a32f21bf0b2f48a7a/helm-charts/images/hub/Dockerfile#L14

So maybe our assumptions are wrong because we have a version mismatch?

Aug 15 '22 13:08 sgibson91

Are you still seeing the 'halting hub...' message, but the hub does not halt? That's how kubespawner is meant to recover from things like this (when in doubt: restart). So if we're back in failing-to-stop territory (We may have reintroduced a form of #525 after the switch to kubernetes-asyncio), this may be a KubeSpawner or JupyterHub bug.

Aug 17 '22 09:08 minrk

Are you still seeing the 'halting hub...' message, but the hub does not halt?

Yes, we have to manually restart the hub.

EDIT: Actually the last time we saw this, we saw the "stopping your server" message but the server never stopped. So it might not quite be the same thing.

Aug 17 '22 09:08 sgibson91

OK, if you're seeing the 'halting hub...' message, that means kubespawner has recognized something's wrong, decided to do the right thing that will fix it (confirmed by the fact that a manual shutdown of the hub does fix it), and then it somehow didn't quite happen.

Unfortunately, when I try to provoke this manually by inserting explicit failures, I get the exact messages @yuvipanda mentions here:

[C 2022-08-17 15:12:43.350 JupyterHub spawner:2381] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-15' coro=<shared_client.<locals>.close_client_task() running at /Users/minrk/dev/jpy/hub/kubespawner/kubespawner/clients.py:58> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x111cb6490>
RuntimeError: coroutine ignored GeneratorExit

but then the Hub exits. This is with jupyterhub 2.3.1, kubespawner f3d44e6e51d2ca16e787963562e5568f2a5494e0 .

Aug 17 '22 13:08 minrk

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

2i2c-org/infrastructure@dd74fe1/helm-charts/images/hub/Dockerfile#L14

So maybe our assumptions are wrong because we have a version mismatch?

As I mentioned here, I'm not sure we can confident about what versions we are running since the base image we are building our hub image from has a different tag to the z2jh chart in our helm charts. I'm not sure how the two are linked, but there is a comment in our Chart.yaml that the base image tag needs bumping with tag bumps in the helm chart

Aug 17 '22 13:08 sgibson91

One way to find out:

kubectl exec -it hub-... -- pip list

Aug 18 '22 07:08 minrk

This is taken from the staging hub running on our main 2i2c cluster (so I'm confident it isn't using it's own, custom hub image). Seems we are a little ahead in jupyterhub version than we first surmised but the kubespawner version is what we guessed.

$ k -n staging exec -it hub-7d6c86fcff-xhwh2 -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

And this is taken from the hub that last saw this issue (it's running on the same cluster, different namespace). It confirms the same versions.

$ k -n ohw exec -it hub-7cbc97f77f-br9jf -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

Aug 18 '22 08:08 sgibson91

OK, then it's got every fix I'm aware of. Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on? The trick is likely that there could be a long time between when the problem occurs and when it's observed, so a lot of logs might be needed to track it down.

Aug 18 '22 09:08 minrk

My hunch is that the switch to kubernetes-asyncio away from threads for the reflectors has changed something in how these failures are observed.

But I'm pretty stumped about how the halting Hub event is firing and the container is not actually exiting, even if it's showing all the statements that lead up to a successful exit that I see.

Aug 18 '22 09:08 minrk

Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on?

I will confer with Yuvi as to whether this is possible for us

Aug 18 '22 09:08 sgibson91

This happened to the staging hub as well right? I figure a reasonable choice is to enable debugging logs for staging hubs specifically.

You can do it via z2jh config i think, debug=true.

Aug 18 '22 10:08 consideRatio

This happened to the staging hub as well right?

Yeah, when it happens usually all hubs on the cluster go down because it's caused by the k8s master outage. So I figure enable it on staging and maybe one or two others as a fallback?

Aug 18 '22 11:08 sgibson91

@sgibson91 not confident about if and how long 2i2c retains logs and the associated data amounts accumulated I'd go a :+1: only on the staging hubs for now, but if you feel at ease with that aspect then :+1: also for a production hub if you think it can be useful generally.

Aug 18 '22 11:08 consideRatio

I discovered we have a logs backup in GCP! So I think retention and space is good, but I will double check.

Aug 18 '22 11:08 sgibson91

@sgibson91 nice!

I've been part of incidents when logs have been in such large amounts that it incurred very significant cost ingesting and storing them. Sometimes more logs is worse as it can incur costs like that, and can be harder to read through logs that have too many lines of information added to them.

Aug 18 '22 11:08 consideRatio

It's entirely possible regular logs over an extended period are sufficient, if you have still them surrounding any recent incidents. The main thing is capturing quite a bit of time to sift through.

Aug 19 '22 09:08 minrk

We just had another incident happen about an hour ago so I'm looking through the logs now

Aug 25 '22 17:08 sgibson91

[logs source]

So I can see a user server pending stop for a long time

A server delete request made that triggers a message about already being active

The a hub shutdown with complaints about leaving single-user servers running and not being able to clean up the proxy. This is when we start to see the asyncio/tornado errors that were reported in other issues.

Aug 25 '22 18:08 sgibson91

gmanuch is the user that reported the issue. There should be a request from me to try to start a server to test if this was the same as last time, before I triggered the hub shutdown to try to clear it.

Aug 25 '22 18:08 abkfenris

I've scrolled through logs over a time period of 11:30 UTC up to 17:00 UTC. Everything looks normal until this 303 GET to gmanuch's lab endpoint comes in. Then we start seeing the message that that server is pending stop. As far as I can tell, it's the only call to /lab in that time frame. @abkfenris do you have any JupyterLab modifications in your image do you know?

I see you have default URL set as lab for the Python image

https://github.com/2i2c-org/infrastructure/blob/8c8e37b012a4c42f916df767b47db1b98d91190c/config/clusters/2i2c/ohw.values.yaml#L50

Aug 25 '22 18:08 sgibson91

For our Python environment (lockfile), we should be getting JupyterLab via pangeo-notebook=2022.07.12, and the only packages that I can think of that would directly change things would be jupyter_contrib_nbextensions and jupyterlab-git. We shouldn't be doing any other modifications during our build.

Aug 25 '22 18:08 abkfenris

Small note that I've also spotted a W 2022-08-25 16:36:58.665 JupyterHub base:393] Invalid or expired cookie token a couple of minutes the first "is pending stop" message which is coming from here.

Maybe the issue is an inconsistent state because of the following:

because the cookie got invalidated, the user object cannot be retrieved by user_for_cookie and so, nor its server which may have never gotten the stop request.
but the SpawnPendingHandler registered the stop request for that particular user and knows it has to wait for it to finish, which will never happen since the user's access got invalidated and hub doesn't know how to trace back the request to the correct user and server

Aug 25 '22 20:08 GeorgianaElena

complaints about leaving single-user servers running

Sorry about this, it's not complaining - this is normal behavior, chosen by z2jh configuration (as opposed to shutting down user servers when the hub shuts down).

I don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

I think it's very unlikely that credentials are the issue, because the SpawnPendingHandler won't do anything if the credentials aren't up to date. Pending state is stored on the Spawner object itself, not related to authentication of any given request, and doesn't get registered until the stop has actually begun here.

But is this the same as the original issue? One user being stuck definitely could be caused by the same issue as the original (reflector problems, general cluster communication problems), but it could also be a different problem related to one call to KubeSpawner.stop never returning, or taking a long time. The timeout for this method is long (at least 5 minutes, I think, depending on configuration), so a minute or two may not mean anything is wrong, just slow.

Aug 26 '22 07:08 minrk

don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

Searching after the username, (starting 14:53, the earliest I checked) there are only logs like the one bellow every 5 mins

[I 2022-08-25 14:53:33.732 JupyterHub log:189] 200 POST /hub/api/users/gmanuch/activity ([email protected]) 13.27ms

And then the 303 that's in the logs shared by @sgibson91:

[I 2022-08-25 16:34:46.423 JupyterHub log:189] 303 GET /hub/user/gmanuch/lab ([email protected]) 12.86ms
2022-08-25 19:34:46.487 EEST
[I 2022-08-25 16:34:46.486 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:46.489 EEST
[I 2022-08-25 16:34:46.489 JupyterHub log:189] 200 GET /hub/spawn-pending/gmanuch?next=%2Fhub%2Fuser%2Fgmanuch%2Flab ([email protected]) 5.66ms
2022-08-25 19:34:52.043 EEST
[I 2022-08-25 16:34:52.043 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:57.207 EEST
[I 2022-08-25 16:34:57.207 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:02.358 EEST
[I 2022-08-25 16:35:02.358 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:07.524 EEST
[I 2022-08-25 16:35:07.524 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:12.681 EEST
[I 2022-08-25 16:35:12.680 JupyterHub pages:401] gmanuch is pending stop

Aug 26 '22 08:08 GeorgianaElena

@minrk I invited you to the project with Logs Viewer role (I'm hoping that's enough for you to do what you need to, but let me know if it's not), so this link should work for you now.

@GeorgianaElena is right though, I checked from 11am UTC and only found calls to the activity endpoint up until that 303 to /lab

Aug 26 '22 09:08 sgibson91

Why this stuff is hard: The relevant error occurred at 2022-08-24 20:00 UTC, while the issue was observed at 2022-08-25T 6:35 UTC, fully 20 hours later.

Aug 26 '22 12:08 minrk

Thank you @minrk 🙏

Aug 26 '22 12:08 sgibson91

Also relevant is longstanding issue in JupyterHub: https://github.com/jupyterhub/jupyterhub/issues/1677 We don't have timeouts on Spawner.stop because it's unclear what the state should be when stop times out or fails. The fact that these user.stop() calls are taking literally forever has very hard to identify symptoms, which could have been clearer if there were timeouts being logged. Plus, that would at least mean you can't be stuck in 'pending' forever (though what state you should be in is unclear).

Aug 26 '22 12:08 minrk

kubespawner kubespawner copied to clipboard

Resilience needed against k8s master outages

Bug description

Expected behaviour

Actual behaviour

How to reproduce

Your personal set up

Related issues

kubespawner
kubespawner copied to clipboard