kubespawner Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn

trafficstars

c.KubeSpawner.k8s_api_threadpool_workers defaults to 5*ncpu [1] which is what a ThreadPoolExecutor in python defaults to as well [2]. The description of that option says:

Increase this if you are dealing with a very large number of users.

In our setup the core node where the hub pod runs is a 4CPU node because the hub doesn't go beyond 1CPU. This means that by default k8s_api_threadpool_workers only has 20 workers.

The c.JupyterHub.concurrent_spawn_limit option defaults to 100 [3] but in zero-to-jupyterhub-k8s is set to 64 [4].

It seems that if you have a lot of users logging in and spawning notebook pods at the same time, like at the beginning of a large user event, you would want k8s_api_threadpool_workers aligned with concurrent_spawn_limit otherwise those spawn requests could be waiting on the thread pool.

We could default k8s_api_threadpool_workers to concurrent_spawn_limit or at least mention the relationship in the config option help docs between the two options.

[1] https://github.com/jupyterhub/kubespawner/blob/5521d573c272/kubespawner/spawner.py#L199 [2] https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor [3] https://jupyterhub.readthedocs.io/en/stable/api/app.html#jupyterhub.app.JupyterHub.concurrent_spawn_limit [4] https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/e4b9ce7eab5c17325e93975de1d6b4a200d47cd8/jupyterhub/values.yaml#L16

Jul 21 '20 13:07 mriedem

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

Jul 21 '20 13:07 welcome[bot]

@minrk FYI

Jul 21 '20 13:07 mriedem

I'm still investigating how jupyterhub uses the spawner, because it looks like the KubeSpawner instance is not a singleton but per user [1][2]. And if you're not using named servers (which is the default) then each User is restricted to a single Spawner for the default ("") server. However, even though we have 1:1 user:spawner instances, the ThreadPoolExecutor within the KubeSpawner is a singleton across all instances [3]. And that's why making the thread pool match concurrent spawn limit could help with throughput when a lot of users are logging in and spawning servers at once.

[1] https://github.com/jupyterhub/jupyterhub/blob/1.1.0/jupyterhub/user.py#L166 [2] https://github.com/jupyterhub/jupyterhub/blob/1.1.0/jupyterhub/user.py#L286 [3] https://github.com/jupyterhub/kubespawner/blob/v0.8.1/kubespawner/spawner.py#L56

Jul 21 '20 14:07 mriedem

@mriedem I love this writeup! Thank you!

I think c.KubeSpawner.k8s_api_threadpool_workers should or need to match c.JupyterHub.concurrent_spawn_limit, but maybe?

As I recall from learning about KubeSpawner, every spawned pod gets a unique python instance of the configured JupyterHub spawner. But as KubeSpawner needs to be aware of what goes on the in the k8s namespace, it doesn't do that monitoring on a python instance level, but instead using the Singleton pattern to create only one namespaced resource reflector that keeps track of all pods for example.

So, if we have 100 spawners starting user pods at the same time, it will certainly trigger 100 requests to the k8s api to create a pod - but thats a single HTTP POST request I think. So, I don't think the individual spawners, even during a startup procedure, will require a thread pool worker. Or do they? I guess the crux is to better understand the use of these workers.

Usage of the thread pool workers explored

Imports https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L15-L20

Class definitions https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L122-L131

Singleton pattern - initialization https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L174-L177

@run_on_executor The executor field of the class is not explicitly referenced in the KubeSpawner source code, but the @run_on_executor decorator makes use of it as it checks for that field name by default. This id documented here.

The only use of the decorator is here. https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L1580-L1582

asyncronize()

The spawner make use of this function when starting and stopping. Every time the start and stop functions call the asyncronize function they await it with yield (old version of await).

asyncronize() relating to PVC creation (start) https://github.com/jupyterhub/kubespawner/blob/5521d573c272e583651a9d16193adb0bdb877df9/kubespawner/spawner.py#L332-L341

https://github.com/jupyterhub/kubespawner/blob/5521d573c272e583651a9d16193adb0bdb877df9/kubespawner/spawner.py#L1772-L1808

asyncronize() relating to Pod creation (start) https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L1810-L1838

asyncronize() relating to Pod deletion (stop) https://github.com/jupyterhub/kubespawner/blob/f50b4b7493a4d409e39116179bfe3bfbd849604b/kubespawner/spawner.py#L1888-L1914

Exploration conclusion

Each spawner does indeed contribute to one worker during startup, but only while waiting on the following calls. The first two are also not run by default. So, the only actual issue would be if we ended up waiting a long time on kubectl create pod, but this is a very quick HTTP POST request that either return that the pod specification was valid and approved or not pretty much.

kubectl create pvc ... - create_namespaced_persistent_volume_claim
kubectl get pvc ... - read_namespaced_persistent_volume_claim
kubectl create pod ... - create_namespaced_pod
kubectl delete pod ... - delete_namespaced_pod

Current thinking

I lean towards thinking that the two configuration options c.JupyterHub.concurrent_spawn_limit and c.KubeSpawner.k8s_api_threadpool_workers doesn't need to be aligned.

Jul 22 '20 06:07 consideRatio

To decide if the thread pool size needs to be bigger or not I think we need a measurement off how many requests to use the threadpool end up having to wait. My intuition is similar to Erik's in that I think each spawn should only use a slot in the threadpool for a second or so while it is sending the POST request. Maybe that isn't true though which is where some measurements of how long these requests take and how often a request ends up getting queued before being executed would help.

Jul 22 '20 07:07 betatim

@yuvipanda do you think c.JupyterHub.concurrent_spawn_limit should adjusted up/down or its about right for z2jh? You set its default value to 64 three years ago in z2jh.

Jul 22 '20 07:07 consideRatio

@consideRatio thanks for digging into this in detail.

I did change c.KubeSpawner.k8s_api_threadpool_workers to match c.JupyterHub.concurrent_spawn_limit in our z2jh extraConfig value like this:

c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit

I'm assuming that worked since the hub started up fine but I'm not sure if the value was actually assigned correctly since I don't know how to dump the hub's settings at runtime [1].

Assuming it was correctly configured, I ran a load testing script to create 400 users (POST /users), start the user notebook servers (pods) in batches of 10 (using a ThreadPoolExecutor since the POST /users/{name}/server API can take a bit, about ~7-10 seconds in our environment), and then wait for them to be ready: True.

Comparing times between having c.KubeSpawner.k8s_api_threadpool_workers at the default (20 for us on a 4CPU core node) and then set to c.JupyterHub.concurrent_spawn_limit (64 per z2jh), it was slightly faster but only about 3% which is probably in the margin of error; I'm guessing if I ran both scenarios more times and averaged them out the gain wouldn't be very noticeable. This likely reinforces the idea that the thread pool size is not an issue.

As for how this could be measured, I'm not really sure how to measure the time spent waiting in the pool for a Future to be executed. It might be possible to track overall time spent for the Future by using add_done_callback and passing in a partial function which has a start time and then calculates the end time when the callback is called to get the overall time spent for the Future, but that wouldn't really tell us how long the Future is sitting in the pool, though it could be a reasonable warning flag if you set some threshold and log a warning if a request took x number of seconds to complete. I don't see an easy way to track wait time in the thread pool from the standard library and sub-classing ThreadPoolExecutor to time things doesn't seem like much fun either (guess it depends on your idea of fun). Other ideas?

[1] https://discourse.jupyter.org/t/is-there-a-way-to-dump-hub-app-settings-config/5305

Jul 22 '20 18:07 mriedem

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/scheduler-insufficient-memory-waiting-errors-any-suggestions/5314/4

Jul 23 '20 23:07 meeseeksmachine

I opened an issue about my preferred solution to this here: https://github.com/jupyterhub/kubespawner/issues/421.

Jul 27 '20 07:07 yuvipanda

This was fixed by https://github.com/jupyterhub/kubespawner/issues/421 and we don't have to deal with threadpools anymore!

Sep 29 '23 20:09 yuvipanda

kubespawner
kubespawner copied to clipboard

Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn_limit

Usage of the thread pool workers explored

Exploration conclusion

Current thinking

kubespawner kubespawner copied to clipboard

Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn_limit

Usage of the thread pool workers explored

Exploration conclusion

Current thinking

kubespawner
kubespawner copied to clipboard