zero-to-jupyterhub-k8s icon indicating copy to clipboard operation
zero-to-jupyterhub-k8s copied to clipboard

Z2JH 4.0.0 fails to detect old named server PVCs due to an overriden template in 3.3.8

Open manics opened this issue 1 year ago • 17 comments

Bug description

Reported in https://discourse.jupyter.org/t/jupyterhub-helm-chart-4-0-0/29887/2

How to reproduce

  1. Deploy Z2JH 3.3.8 with persistent user storage
  2. Start and stop a named server
  3. Upgrade Z2JH to 4.0.0
  4. Start the named server, a new PVC is created

Expected behaviour

Z2JH should use the existing named PVC

Actual behaviour

3.3.8 set pvcNameTemplate: claim-{username}{servername}, overriding the KubeSpawner 6.2.0 default of claim-{username}--{servername} https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/3.3.8/jupyterhub/values.yaml#L395 https://github.com/jupyterhub/kubespawner/blob/6.2.0/kubespawner/spawner.py#L569

4.0.0 uses the default from KubeSpawner https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/4.0.0/jupyterhub/values.yaml#L416

As a result the automatic legacy PVC detection in KubeSpawner fails, since it assumes the 6.2.0 template is used https://github.com/jupyterhub/kubespawner/blob/a8f28439078e42e8012de8f141b51bd6fa96d9c7/kubespawner/spawner.py#L3118

[D 2024-11-15 14:05:10.037 JupyterHub spawner:3117] Checking for legacy-named pvc claim-test-40example-2eorg--named for [email protected]

Instead of claim-test-40example-2eorgnamed

manics avatar Nov 15 '24 14:11 manics

Short-term, a deployment can work around this by

  1. explicitly setting the old template in config
  2. starting and stopping all servers, which persists the pvc name in spawner state
  3. removing the old template config; since PVC names are persisted now, new users will use the new template while existing users will keep their PVCs

alternatively, we could write a bulk-rename script to rename PVCs.

minrk avatar Nov 19 '24 08:11 minrk

@minrk A bulk-rename script would work for our case. Could also support downgrading if that's within the scope. May have to deal with servers that now have two PVCs associated with them

aychang95 avatar Nov 20 '24 18:11 aychang95

yes, servers that have upgraded and associated with new pvcs are going to be the hardest because we can't know if there's any data in the new pvcs that needs to be preserved, so the only safe option is to merge their contents, which we can't do automatically. But we could at least:

  1. safely rename every old pvc, and
  2. if any pvc was created under the new name would result in a collision, rename the new one identifiably so that they can be manually inspected for merge

or perhaps the simplest is a script that iterates over existing PVCs and servers and pushes the pvc name into Spawner.state (i.e. effectively the same thing as preserving the old template, starting/stopping servers, then updating the template).

minrk avatar Nov 21 '24 08:11 minrk

Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5

it collects all pvcs and spawners, then associates spawners with their pvcs.

  1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)
  2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)

it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.

minrk avatar Nov 21 '24 08:11 minrk

Do you think we should put your script into an optional initContainer in this chart? Perhaps with a flag to control how to resolve multiple matching PVCs (oldest, newest, or return an error to force the admin to sort things out before starting JupyterHub).

manics avatar Nov 21 '24 10:11 manics

Yes, possibly. I'm also investigating whether this belongs in the legacy pvc check in kubespawner, because I think we can change that to consider labels and annotations as well, so it's not sensitive to template changes across the upgrade.

Either way, if we run this automatically for users, we will need to be careful to consider valid use cases of deliberate sharing of PVCs across servers, e.g.

  1. one pvc per user for all servers: pvc_name_template = '{username}'
  2. shared pvc across users for a given named server: pvc_name_template = '{servername}'
  3. one pvc for everyone (?!): pvc_name_template = 'shared'
  4. manually created PVCs out-of-band, may be missing labels and/or annotations which were not checked previously

whereas it is appropriate for the script above to only handle the default template case, since it is only run by hand, so users have a chance to make a decision whether it is appropriate before it runs.

minrk avatar Nov 21 '24 10:11 minrk

Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5

it collects all pvcs and spawners, then associates spawners with their pvcs.

1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)

2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)

it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.

Nice script!

Tried running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply, and get "Made # changes", but it doesn't seem to actually applied the changes. If i run the command again, it continues to say "Made # changes". The servers seem to continue to attach to the newest PVC and creates a new PVC even if the old one exists.

aychang95 avatar Nov 22 '24 20:11 aychang95

Can you share more of the output of your script run (feel free to obfuscate usernames)? Does the output change at all from one run to the next?

minrk avatar Nov 24 '24 13:11 minrk

Output doesn't seem to change at all for either execs.

Running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 | grep "dry run"

Output:

!!!!! linking server [email protected]/vector-embeddings to pvc claim-andrew-40example-2ecomvector-2dembeddings (dry run) !!!!
!!!!! linking server [email protected]/datasets-exploratory to pvc claim-andrew-40example-2ecomdatasets-2dexploratory (dry run) !!!!
!!!!! linking server [email protected]/modeling to pvc claim-andrew-40example-2ecommodeling (dry run) !!!!
This was a dry run, no changes were made.

Then we run cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

Output:

...
Made 3 changes

However, seems that nothing actually changes since we get the same outputs if we run the same two commands above again. Same dry runs appear without apply and applying it always outputs the same "Made # changes"

aychang95 avatar Nov 25 '24 18:11 aychang95

Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?

I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.

When I run the script for the first time with:

cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

I get several 'linking server...' messages, but if I run the same command a second time, I get:

username/ has pvc claim-username
  username/ is linked to only matching pvc claim-username (good)
...
Nothing to do!

minrk avatar Nov 26 '24 07:11 minrk

Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?

I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.

When I run the script for the first time with:

cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

I get several 'linking server...' messages, but if I run the same command a second time, I get:

username/ has pvc claim-username
  username/ is linked to only matching pvc claim-username (good)
...
Nothing to do!

Happy new year!

Hmm this script seemed to have worked now! Not sure if there was a specific state change that triggered it since it seems only logs were added, but now legacy-named claims are linked to their respective servers on --apply.

Thank you!

aychang95 avatar Jan 02 '25 13:01 aychang95

I'm still confused how to upgrade. Are there any clear instructions posted?

dimm0 avatar Mar 04 '25 23:03 dimm0

I am now experiencing this issue, even upgraded to version 4.1.0

scoronado-usn avatar Mar 25 '25 15:03 scoronado-usn

I seem to have found the proper documentation to fix this.

Dropping this in case an upgrader never caught this: https://z2jh.jupyter.org/en/stable/administrator/upgrading/upgrade-3-to-4.html#kubespawner-7

    static:
      pvcName: "jupyterhub-user-home-pvc"
      subPath: "{escaped_username}"

scoronado-usn avatar Mar 25 '25 16:03 scoronado-usn

I can confirm this is an issue with some user not all users, We also upgraded from 3.3.8 to 4.0.0 and for one of the user old PVC was not recognizing not sure why, and it keep creating new one. I tried bunch of configs that could not resolve it at all. We have to revert the JH to 3.3.8 by uninstalling and reinstalling since migrations are not backward compatible :(

Following changes I have t ried:

  1. As per their doc: https://z2jh.jupyter.org/en/stable/administrator/upgrading/upgrade-3-to-4.html#kubespawner-7 I used following config it was not able to start the pod complaining about PVC not found, slug_scheme was not making any difference at all.

    hub:
      config:
        KubeSpawner:
          slug_scheme: escape
    singleuser:
      storage:
        type: static
        static:
          subPath: "{escaped_username}"
    

    Error we encounter:

    "status":"Failure","message":"Pod \"jupyter-username-40domain-2ecom\" is invalid: 
    [spec.volumes[0].persistentVolumeClaim.claimName: Required value, spec.containers[0].volumeMounts[0].name: Not found:
     \"home\"]","reason":"Invalid"
    
  2. Then I tried this config, Now at least we were able to recognize old PVC using pvcName: claim-{username}. But then there was nothing in the home folder after login. Which documentation mention it should fix it with subPath config but it did not unfortunately.

    hub:
      config:
        KubeSpawner:
          slug_scheme: escape
    singleuser:
      storage:
        type: static
        static:
          pvcName: claim-{username}
          subPath: "{escaped_username}"
    
  3. Third thing I tried, Since PVC type suppose to dynamic (by default in the previous version I guess?), because each user has their own PVC, So I updated to below, this also did not work

    hub:
      config:
        KubeSpawner:
          slug_scheme: escape
    singleuser:
      storage:
        type: dynamic
        dynamic:
          subPath: "{escaped_username}"
          pvcNameTemplate: claim-{username}
    

Not sure what else to try so we end up reverting it to previous version by uninstalling and reinstalling JH using helm.

mananpreetsingh avatar Jul 02 '25 14:07 mananpreetsingh

I am using jupyethub helm chart 4.1.0

When i delete and redeploy the helm chart the same pv is not getting attached to the PVC.

While helm chart gets deleted the PVC named claim-{username} also gets deleted. Due to the dynamically created PV enters into Released state. When redeployed again a new PVC is created which is not getting attached to the same volume since state is changed from Bound to Released.

We need a configuration to make the PVC persistent when helm chart is deleting. So the PV doesnt enter into Released state.

I tried this in extra_storage_annotations. But not respecting. helm.sh/resource-policy: keep

Any other ways to prevent it from deleting??

Dinesh-4320 avatar Jul 16 '25 12:07 Dinesh-4320

Dumping a recollection on things.

I recall that there was automation in KubeSpawner to detect old PVC names etc, but that it relied on reading a configuration of a PVC name template. We got some issues related to this when releasing z2jh 4.0.0 because as part of z2jh 4.0.0 we also changed that PVC name template, which made KubeSpawners automation to detect old PVC names couldn't predict the old names because the PVC name template had just been changed when KubeSpawner 7 launched for the first time.

consideRatio avatar Oct 07 '25 18:10 consideRatio