nebari Use Rook Ceph for Jupyterhub and Conda Store drives

Reference Issues or PRs

Related to https://github.com/nebari-dev/nebari/issues/2534 Deploys rook operator and rook-cluster helm charts, sets up some StorageClasses backed by a Rook Ceph cluster and uses those storage classes for what were previously the jupyterhub and conda store NFS drives.

Developed on Azure at the moment.

What does this implement/fix?

Put a x in the boxes that apply

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds a feature)
[ ] Breaking change (fix or feature that would cause existing features not to work as expected)
[ ] Documentation Update
[ ] Code style update (formatting, renaming)
[ ] Refactoring (no functional changes, no API changes)
[ ] Build related changes
[ ] Other (please describe):

Testing

[ ] Did you test the pull request locally?
[ ] Did you add new tests?

Any other comments?

Jun 25 '24 01:06 Adam-D-Lewis

See TODOs in code for what is still left.
Also, try to reduce storage node size.
Make using NFS or Ceph configurable in nebari-config.yaml
Test migration

Jun 25 '24 21:06 Adam-D-Lewis

One issue I'm hitting is the Ceph cluster deployment takes a long time but helm thinks it's done b/c it just deploys a CephCluster object which is quick and the ceph operator handles the longer task of waiting for a bunch of rook resources to be created (in successive steps).

What this means in practice currently is that the deploy step times out, and ceph isn't ready for another 5-10 minutes even beyond the timeout.

I'm not sure the best way to address this. Maybe some sort of readiness check (in k8s or in python) can be added or increase the timeout or deploy ceph on an earlier stage of Nebari to give the user extra time to get ready.

Jun 28 '24 16:06 Adam-D-Lewis

Currently, there are 3 storage nodes that require about 1cpu/2-6 GB of RAM. On Azure, I added 3 2cpu/8GB Ram nodes for storage adding $210/month to the operating cost of Nebari (on top of the pre-existing $280/month). We may be able to use a single node by using cluster-test.yaml from here, but we likely won't get the same performance speedup and would not have the redundancy benefit that a more typical multi-node ceph deployment can give.

Jun 28 '24 17:06 Adam-D-Lewis

B/c there isn't a pressing need for this, and the benefits of using ceph seem to only come with significant increased cost, I'm planning to put this on the backburner for the moment.

Questions we need to resolve:

Can you run Ceph without high-availability and disaster recovery with a single server so the cost is not increased beyond current costs? Is there a benefit in doing so? Speeds may not be faster anymore with a single node. You can consolidate block, object, and file system storage under ceph so maybe that's helpful.
Do we want to make Ceph configurable (support NFS and Ceph) or only allow Ceph from now on?

Jun 28 '24 21:06 Adam-D-Lewis

I think we can make Ceph the first stage after the k8s cluster (so stage 3?) and we can see if that is enough.

As for costs, we could make ceph single node the default and have a HA flag in the config to move to multnode, HA setup. That way we still have the abstracted storage interface, but not the increased cost.

@Adam-D-Lewis can you test with a single node ceph and see how it runs?

Jul 02 '24 21:07 dcmcand

@dcmcand Yeah, I'll test with a single node

Jul 02 '24 21:07 Adam-D-Lewis

While single node ceph seems to be working in my limited use, I'm not sure I'd be comfortable making it the default initially since a single node ceph deployment is not recommended for production use in the Rook docs. One thing I want to check is that we understand what happens if the general node is restarted. I also think we should use it in a test deployment for a while to ensure it's working as expected.

Currently, we use EFS on AWS deployments, NFS on all other deployments. I suggest we add the option to specify in the config whether or not we want to use EFS, NFS, or Ceph for the jupyterlab user and shared directories and for the conda store shared directories. Perhaps we add an option under the current storage section as listed below.

Currently the storage section looks like the following by default.

storage:
  conda_store: 200Gi
  shared_filesystem: 200Gi

and I think we should change it to the following.

storage:
  type: "NFS"  # EFS, NFS, or Ceph
  conda_store: 200Gi
  shared_filesystem: 200Gi

or possibly the more flexible (though not preferred in my opinion)

storage:
  conda_store:
    type: "NFS"  # EFS, NFS, or Ceph
    size: 200Gi
  shared_filesystem:
    type: "NFS"  # EFS, NFS, or Ceph
    size: 200Gi

It will likely be more work to ensure that users can switch between NFS and Ceph which is a pain. An alternative would be I make a branch of Nebari in which Ceph replaces NFS. We test the branch for some time and then make the switch with upgrade instructions once we feel comfortable with the branch.

Jul 15 '24 15:07 Adam-D-Lewis

Okay, so my plan for this PR is to modify the nebari-config.yaml as shown here

storage:
  type: "Ceph"  # NFS (default) or Ceph
    storage_class_name: "default based on provider used"
  conda_store: 200Gi
  shared_filesystem: 200Gi

Then, you can choose Ceph or NFS on the initial deployment. I'll try to find a way to throw a warning if people try to change the value or I'll disallow it and say you have to do a new deployment to change that value. Ceph probably won't be accessible on AWS at first since we use EFS on AWS not our own NFS. I should be able to make ceph work on gcp, azure, and existing deployments.

Note: Rook should not be run on local as explained in the docs.

We could likely make rook run locally (e.g. for use in a VM), but we'd also need one of the following per rook docs.

Raw devices (no partitions or formatted filesystem)
Raw partitions (no formatted filesystem)
LVM Logical Volumes (no formatted filesystem)
Encrypted devices (no formatted filesystem)
Multipath devices (no formatted filesystem)
Persistent Volumes available from a storage class in block mode

I don't think we can count on most Nebari developers having a raw device or partition available for use. We could create a loopback device as explained here and use it, but it seems acceptable to not support ceph in the local cluster for now.

Jul 17 '24 15:07 Adam-D-Lewis

I'm working on getting this deployed on GKE. I had to add

apiVersion: v1
kind: ResourceQuota
metadata:
  # annotations:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  name: rook-critical-pods
  namespace: rook-ceph
spec:
  hard:
    pods: 1G
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-node-critical
      - system-cluster-critical

that helped get some pods deployed, but now I'm facing a new issue that the ceph pvcs won't mount on the jupyterhub and dask gateway pods. Trying to resolve. Update: I believe https://rook.io/docs/rook/latest-release/Getting-Started/Prerequisites/prerequisites/#rbd is the solution.

Jul 18 '24 18:07 Adam-D-Lewis

Okay, so everything deploys, but conda store workers aren't working correctly yet so I'll tackle that next.

Jul 18 '24 22:07 Adam-D-Lewis

Conda envs are kind of working now. I noticed spinning up a new jupyterlab server took quite a bit longer, but it seemed to only be an issue right after the cluster was first created.

Also, the ceph cluster failed to deploy correctly the first time I think due to some race condition in the order in which resources were deployed. I may need to spend some time ensuring terraform resources are created in the correct order.

Jul 19 '24 05:07 Adam-D-Lewis

Things are looking good on the deployment.

[x] I should test deploying NFS and make sure that still works.
[ ] I should test deploying an Azure and make sure it still works.

Ideally, nebari destroy needs a bit more work. The problem is that ceph operator pod gets deleted before it has finished deleting these resources. This caused deleting the rook-ceph namespace to lag a bit. After a few minutes, the namespace and those resources were deleted anyway so nebari-destroy was still successful.

NAME                                DATA   AGE
configmap/rook-ceph-mon-endpoints   5      16m
NAME                   TYPE                 DATA   AGE
secret/rook-ceph-mon   kubernetes.io/rook   4      16m
NAME                                 DATADIRHOSTPATH   MONCOUNT   AGE   PHASE      MESSAGE                    HEALTH      EXTERNAL   FSID
cephcluster.ceph.rook.io/rook-ceph   /var/lib/rook     1          17m   Deleting   Deleting the CephCluster   HEALTH_OK              c53dbc2f-c8c2-42e0-85aa-221ca7133b57
NAME                                          ACTIVEMDS   AGE   PHASE
cephfilesystem.ceph.rook.io/ceph-filesystem   1           17m   Ready
NAME                                                            PHASE   FILESYSTEM        QUOTA   AGE
cephfilesystemsubvolumegroup.ceph.rook.io/ceph-filesystem-csi   Ready   ceph-filesystem           17m

Update: To solve this I've decided to leave the operator running whether you use NFS or Ceph. It is a single container and shouldn't require much overhead when not using ceph.

Jul 22 '24 22:07 Adam-D-Lewis

I'm currently hitting an issue.

(dependencies on the left) module.jupyterhub-nfs-mount -> module.jupyterhub module.jupyterhub-cephfs-mount -> module.jupyterhub

but when I switch from nfs to cephfs (or vice versa), the pvc created in the nfs and cephfs-mount modules, can't be destroyed b/c it is used by the hub pod from the jupyterhub module.

The solution is to trigger some of the jupyterhub module resources to be recreated when the nfs and cephfs mount modules are changed perhaps by using replace_triggered_by.

Jul 24 '24 21:07 Adam-D-Lewis

Going from NFS to Ceph works, but when going from Ceph to NFS, the nebari-conda-store-storage PVC is created, but it won't bind until a consumer of the PVC is created (conda-store-worker), but conda-store-worker is waiting for the PVC to finish being created before it will recreate itself with the different settings. if we trigger based on conda-store-fs then I think it'll work, but not quite as robust.

Update: I think latest commit fixes this. I'm going to focus on making Ceph run on AWS now.

Jul 26 '24 16:07 Adam-D-Lewis

I think I've made all the changes necessary for ceph to run on AWS, but I haven't tested it.

Jul 26 '24 22:07 Adam-D-Lewis

Remaining issues:

[x] Restart the general node and see what happens
- Update: the files, conda envs persisted even if the node goes down. Things start back up normally once the node comes back up.
[ ] Nebari destroy fails b/c the operator is killed before it's finished cleaning up the resources
[x] Clean up the code a bit (remove comments, etc.)

Jul 30 '24 17:07 Adam-D-Lewis

This PR is ready for review. You can do a deployment by adding storage.type = cephfs to the nebari config file. You shouldn't switch an existing nebari deployment from nfs (default) to cephfs b/c you'll lose all your files and conda envs. I'll write up a docs PR detailing this as experimental. I think we should start using this on Quansight's deployment (after we do a backup) to get a feel for it, and eventually make it the default if all goes well.

storage:
  type: cephfs

Jul 31 '24 23:07 Adam-D-Lewis

@Adam-D-Lewis did you workout the destroy bits?

Aug 01 '24 14:08 viniciusdc

@Adam-D-Lewis did you workout the destroy bits?

No, not yet. The destroy still occurs, but it does take longer at the moment, and a message saying stage 07 had problems destroying, but only stages 01 and 02 are usually necessary for a successful destroy.

Aug 01 '24 14:08 Adam-D-Lewis

Update: Fixed ~I deployed on Azure today and timed out. Redeploying resulted in a successful deployment.~ I'll work on fixing that, but I'd appreciate an initial review even before then.

Aug 01 '24 23:08 Adam-D-Lewis

Docs PR - https://github.com/nebari-dev/nebari-docs/pull/493 deploy-preview-493--nebari-docs.netlify.app

Aug 02 '24 21:08 Adam-D-Lewis

I attempted to just have the operator wait before it was destroyed with a local-exec provisioner, but that didn't work. There are all kinds of rook-ceph protections stopping the filesystem from being deleted. Attempting to delete the cephcluster results in a message along the lines of "failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed until dependent objects delted CephFileSystemSubVolumeGroup and CephFileSystem."

The only way I could delete the CephFileSystemSubVolumeGroup and CephFileSystem was through manually deleting the finalizers on them. I tried adding the rook.io/force-deletion="true" annotation, but that didn't work. I also tried setting the cephClusterSpec.cephConfig.cleanupPolicy.confirmation = "yes-really-destroy-data" in the cephcluster helm chart values, but that didn't seem to have an effect either.

Even after deleting the CephFileSystemSubVolumeGroup and CephFileSystem via the finalizers, the Cephcluster still won't delete. The following logs are in the operator. "failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed to check if volumes exist for CephCluster in namespace "prod": waiting for csi volume attachments in cluster "prod" to be cleaned up".

Overriding the Cephcluster finalizers also deleted it. So worse case, we could override the finalizers on those 3 objects in the destroy step of Nebari stage 07 and then nebari destory would succeed, but it's a bit hacky, but it's unlikely to cause a problem since we would only do so when we are deleting the cluster anyway.

My view is still that while this is not ideal, we should still merge it in and start testing rook-ceph and fix this issue before moving ceph from expirmental to general availability.

Aug 05 '24 22:08 Adam-D-Lewis

@Adam-D-Lewis can you merge the lastest develop in?

Aug 19 '24 11:08 dcmcand

Thanks for the review @dcmcand. I responded to your questions below.

@Adam-D-Lewis I think this is great work.

Some questions though:

Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - https://github.com/nebari-dev/nebari/issues/2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Aug 19 '24 16:08 Adam-D-Lewis

Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

:+1:

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

~~The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded.~~ I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

:+1:

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

I think that is fine for an alpha feature, but given the risk of user data loss, I think this is a problem we need to solve before we go mainstream with ceph, either by figuring out how to migrate, or by blocking switching

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

Aug 20 '24 11:08 dcmcand

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

~The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded.~ I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

Here is the issue about moving other PVCs to use ceph for storage.

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

Yeah, I'm happy for backup and restore to be the switching plan. I'll test this out after this is merged since I'll want us to test using Rook Ceph on the Quansight Nebari deployment after this PR is merged.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

I doubt it would be very hard to make that change, and I'd be happy to comment on an issue with more thoughts.

Aug 20 '24 16:08 Adam-D-Lewis

Thanks again for the review, @dcmcand. Is there anything else you want to see before this is merged?

Aug 20 '24 16:08 Adam-D-Lewis

The failing test in the Local Integration Tests has been seen in other PRs. @viniciusdc is working on a fix, and it is unrelated to this PR, so I will merge this PR.

Aug 21 '24 16:08 Adam-D-Lewis