Z2JH 4.0.0 fails to detect old named server PVCs due to an overriden template in 3.3.8
Bug description
Reported in https://discourse.jupyter.org/t/jupyterhub-helm-chart-4-0-0/29887/2
How to reproduce
- Deploy Z2JH 3.3.8 with persistent user storage
- Start and stop a named server
- Upgrade Z2JH to 4.0.0
- Start the named server, a new PVC is created
Expected behaviour
Z2JH should use the existing named PVC
Actual behaviour
3.3.8 set pvcNameTemplate: claim-{username}{servername}, overriding the KubeSpawner 6.2.0 default of claim-{username}--{servername}
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/3.3.8/jupyterhub/values.yaml#L395
https://github.com/jupyterhub/kubespawner/blob/6.2.0/kubespawner/spawner.py#L569
4.0.0 uses the default from KubeSpawner https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/4.0.0/jupyterhub/values.yaml#L416
As a result the automatic legacy PVC detection in KubeSpawner fails, since it assumes the 6.2.0 template is used https://github.com/jupyterhub/kubespawner/blob/a8f28439078e42e8012de8f141b51bd6fa96d9c7/kubespawner/spawner.py#L3118
[D 2024-11-15 14:05:10.037 JupyterHub spawner:3117] Checking for legacy-named pvc claim-test-40example-2eorg--named for [email protected]
Instead of claim-test-40example-2eorgnamed
Short-term, a deployment can work around this by
- explicitly setting the old template in config
- starting and stopping all servers, which persists the pvc name in spawner state
- removing the old template config; since PVC names are persisted now, new users will use the new template while existing users will keep their PVCs
alternatively, we could write a bulk-rename script to rename PVCs.
@minrk A bulk-rename script would work for our case. Could also support downgrading if that's within the scope. May have to deal with servers that now have two PVCs associated with them
yes, servers that have upgraded and associated with new pvcs are going to be the hardest because we can't know if there's any data in the new pvcs that needs to be preserved, so the only safe option is to merge their contents, which we can't do automatically. But we could at least:
- safely rename every old pvc, and
- if any pvc was created under the new name would result in a collision, rename the new one identifiably so that they can be manually inspected for merge
or perhaps the simplest is a script that iterates over existing PVCs and servers and pushes the pvc name into Spawner.state (i.e. effectively the same thing as preserving the old template, starting/stopping servers, then updating the template).
Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5
it collects all pvcs and spawners, then associates spawners with their pvcs.
- if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)
- if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)
it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.
Do you think we should put your script into an optional initContainer in this chart? Perhaps with a flag to control how to resolve multiple matching PVCs (oldest, newest, or return an error to force the admin to sort things out before starting JupyterHub).
Yes, possibly. I'm also investigating whether this belongs in the legacy pvc check in kubespawner, because I think we can change that to consider labels and annotations as well, so it's not sensitive to template changes across the upgrade.
Either way, if we run this automatically for users, we will need to be careful to consider valid use cases of deliberate sharing of PVCs across servers, e.g.
- one pvc per user for all servers:
pvc_name_template = '{username}' - shared pvc across users for a given named server:
pvc_name_template = '{servername}' - one pvc for everyone (?!):
pvc_name_template = 'shared' - manually created PVCs out-of-band, may be missing labels and/or annotations which were not checked previously
whereas it is appropriate for the script above to only handle the default template case, since it is only run by hand, so users have a chance to make a decision whether it is appropriate before it runs.
Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5
it collects all pvcs and spawners, then associates spawners with their pvcs.
1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence) 2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.
Nice script!
Tried running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply, and get "Made # changes", but it doesn't seem to actually applied the changes. If i run the command again, it continues to say "Made # changes". The servers seem to continue to attach to the newest PVC and creates a new PVC even if the old one exists.
Can you share more of the output of your script run (feel free to obfuscate usernames)? Does the output change at all from one run to the next?
Output doesn't seem to change at all for either execs.
Running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 | grep "dry run"
Output:
!!!!! linking server [email protected]/vector-embeddings to pvc claim-andrew-40example-2ecomvector-2dembeddings (dry run) !!!!
!!!!! linking server [email protected]/datasets-exploratory to pvc claim-andrew-40example-2ecomdatasets-2dexploratory (dry run) !!!!
!!!!! linking server [email protected]/modeling to pvc claim-andrew-40example-2ecommodeling (dry run) !!!!
This was a dry run, no changes were made.
Then we run cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply
Output:
...
Made 3 changes
However, seems that nothing actually changes since we get the same outputs if we run the same two commands above again. Same dry runs appear without apply and applying it always outputs the same "Made # changes"
Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?
I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.
When I run the script for the first time with:
cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply
I get several 'linking server...' messages, but if I run the same command a second time, I get:
username/ has pvc claim-username
username/ is linked to only matching pvc claim-username (good)
...
Nothing to do!
Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?
I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.
When I run the script for the first time with:
cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --applyI get several 'linking server...' messages, but if I run the same command a second time, I get:
username/ has pvc claim-username username/ is linked to only matching pvc claim-username (good) ... Nothing to do!
Happy new year!
Hmm this script seemed to have worked now! Not sure if there was a specific state change that triggered it since it seems only logs were added, but now legacy-named claims are linked to their respective servers on --apply.
Thank you!
I'm still confused how to upgrade. Are there any clear instructions posted?
I am now experiencing this issue, even upgraded to version 4.1.0
I seem to have found the proper documentation to fix this.
Dropping this in case an upgrader never caught this: https://z2jh.jupyter.org/en/stable/administrator/upgrading/upgrade-3-to-4.html#kubespawner-7
static:
pvcName: "jupyterhub-user-home-pvc"
subPath: "{escaped_username}"
I can confirm this is an issue with some user not all users, We also upgraded from 3.3.8 to 4.0.0 and for one of the user old PVC was not recognizing not sure why, and it keep creating new one. I tried bunch of configs that could not resolve it at all. We have to revert the JH to 3.3.8 by uninstalling and reinstalling since migrations are not backward compatible :(
Following changes I have t ried:
-
As per their doc: https://z2jh.jupyter.org/en/stable/administrator/upgrading/upgrade-3-to-4.html#kubespawner-7 I used following config it was not able to start the pod complaining about PVC not found,
slug_schemewas not making any difference at all.hub: config: KubeSpawner: slug_scheme: escape singleuser: storage: type: static static: subPath: "{escaped_username}"Error we encounter:
"status":"Failure","message":"Pod \"jupyter-username-40domain-2ecom\" is invalid: [spec.volumes[0].persistentVolumeClaim.claimName: Required value, spec.containers[0].volumeMounts[0].name: Not found: \"home\"]","reason":"Invalid" -
Then I tried this config, Now at least we were able to recognize old PVC using
pvcName: claim-{username}. But then there was nothing in the home folder after login. Which documentation mention it should fix it withsubPathconfig but it did not unfortunately.hub: config: KubeSpawner: slug_scheme: escape singleuser: storage: type: static static: pvcName: claim-{username} subPath: "{escaped_username}" -
Third thing I tried, Since PVC type suppose to dynamic (by default in the previous version I guess?), because each user has their own PVC, So I updated to below, this also did not work
hub: config: KubeSpawner: slug_scheme: escape singleuser: storage: type: dynamic dynamic: subPath: "{escaped_username}" pvcNameTemplate: claim-{username}
Not sure what else to try so we end up reverting it to previous version by uninstalling and reinstalling JH using helm.
I am using jupyethub helm chart 4.1.0
When i delete and redeploy the helm chart the same pv is not getting attached to the PVC.
While helm chart gets deleted the PVC named claim-{username} also gets deleted. Due to the dynamically created PV enters into Released state. When redeployed again a new PVC is created which is not getting attached to the same volume since state is changed from Bound to Released.
We need a configuration to make the PVC persistent when helm chart is deleting. So the PV doesnt enter into Released state.
I tried this in extra_storage_annotations. But not respecting. helm.sh/resource-policy: keep
Any other ways to prevent it from deleting??
Dumping a recollection on things.
I recall that there was automation in KubeSpawner to detect old PVC names etc, but that it relied on reading a configuration of a PVC name template. We got some issues related to this when releasing z2jh 4.0.0 because as part of z2jh 4.0.0 we also changed that PVC name template, which made KubeSpawners automation to detect old PVC names couldn't predict the old names because the PVC name template had just been changed when KubeSpawner 7 launched for the first time.