nebari icon indicating copy to clipboard operation
nebari copied to clipboard

Conda store worker scaling issues

Open Adam-D-Lewis opened this issue 1 year ago • 5 comments

Context

We'd like to scale conda store workers up to allow many solves at the same time. See here for more info. However, the conda store worker pod also acts as a NFS server. I believe this is b/c writing the environment files after the solve would be slow over the network (to an NFS drive) so we just co-locate the NFS server with the conda-store worker so files aren't going over the network when saving. However, this limits the conda store worker scalability beyond a single node (the node the NFS server is on).

Options:

  1. We could separate the conda store workers from NFS server, but the conda store workers become NFS clients and then have to write many small files (~40k to ~150k files for each env) over the network which is slow and likely adds 10 minutes (I want to benchmark this, but haven't yet) to the conda store env creation process which is not ideal and also likely won't scale with many concurrent conda solves anyway. With this option, we'd likely want to also try to improve NFS performance. This guide or this guide may have some useful tips.

  2. We could try a distributed file system like CephFS, but it may require a learning curve and dealing with that additional complexity.

  3. We could tar the environments and write to object storage and have some logic to pull them from object storage when a user boots up a pod, but this would require significant change to our current process, and so would be a lot of work, and may have additional issues we haven't considered yet (making sure users get new envs without restarting their pod, etc).

  4. We could try EFS, Google File Store, and Azure Files, (not sure if DO has an NFS equivalent), but requires more maintenance since it's a separate solution for each cloud provider. Those are also NFS or SMB drives anyway, so unless the cloud providers have optimized them somehow (very possible), they may not be any better than what we currently have.

  5. I'm really not sure if this is a good idea, but I'll throw it out there anyway. We could mount object storage as a filesystem. Store the conda envs in object storage (not zipped or tarred) and use object storage like a local file storage. Again, I'm not sure what drawbacks this might have. Latency is high on object storage, but maybe it's okay if it's just loading the python binary and intalled libraries from there. Not sure. Because latency is high, we may have the same problems as the NFS drive though (slow writing of environment files after solve). FYI, I see that the AWS s3 latency is ~100-200 milliseconds if we want to compare with our NFS performance later.

Anything else?

I looked at how large envs were and how many files they consisted of on a Nebari delpoyment, and found the following. image You can see that many envs have ~10k files, but it's not uncommon to have ~100k and some even have ~140k files.


image Many envs are <1GiB in total size, most envs are <10GiB, but some are >20GiB in total size.


image The 10, 20, and 30 on the x axis represent 1KiB, 1MiB, and 1GiB. This graph is individual number of files vs file sizes. The distribution is roughly lognormal with the max at 1KiB and few files > 1 MiB in size.

Adam-D-Lewis avatar Jun 11 '24 15:06 Adam-D-Lewis

We used to use EFS including the high performance EFS. This still had the same r/w slowdown issues.

CephFS might be worth looking into. We have had clients that have used it before and it is a mature solution.

dharhas avatar Jun 11 '24 15:06 dharhas

Older discussion on the topic: https://github.com/orgs/nebari-dev/discussions/1150 Chris recommends using docker images or tarballs of the environments.

Docker Images

I'm not sure how docker images would work. Conda store envs would need to include jupyterlab in every environment and any desired jupyterlab extensions, I guess? It'd also need to be compatible with whatever version of jupyterhub we have running. Then you'd only have 1 environment available to you in your user pod? You'd have to select the conda env you want to boot up along with the instance size?

Tarballs

I think tarballs would be similar to option 3 in my original description. I'll throw out some possible implementation details here. You put the conda store workers as side car containers to the jupyterlab user server and is notified anytime an environment pertaining to that user (user or user groups) is created/modified/changed which env is active as already occurs with conda-store-workers. The sidecar container would tar and upload the saved environment to conda store and update the symlinks similar to what the conda-store worker already does. I think the only change would be that on startup, the conda-store worker would need to download all the tarballs from conda store and unpack them in the appropriate locations. There is the potential for temporary weird file system bugs (would be fixed when you boot up a new jupyterlab user pod) since there isn't a shared file system anymore. I wonder if we could display kernels and only download and unpack them when the user goes to use them (e.g. open a notebook, run conda activate, etc.)

This conda-store worker sidecar introduces other issues. The user pods would need to make sure they are big enough to build that particular conda env which would be a hassle to spin up a larger server just to do a conda build then spin up your smaller server again. Maybe this memory issue is solved if every user pod has access to some high speed disk that can be used as swap as suggested here, but we need to test how much RAM conda solves use and how long it takes to solve the env using and not using swap to get a better idea of how big of an issue this is and if the swap will solve it satisfactorily.

It might also introduce some security concerns since the conda store worker pod could modify the tarball prior to upload and affect other users (in the case of a group environment), but I'm not sure this is any worse than what we currently have. Users still won't have write access to that directory so they can't modify it prior to upload and we always require trust from the admin who chooses which docker image to use.

Adam-D-Lewis avatar Jun 11 '24 17:06 Adam-D-Lewis

I have run a ceph cluster in the past, and it is not trivial. However, as @dharhas says, it is a very mature solution. It looks like https://rook.io/ provides an easy onramp for ceph on k8s. Looking at https://www.digitalocean.com/community/tutorials/how-to-set-up-a-ceph-cluster-within-kubernetes-using-rook#step-3-adding-block-storage it seems like that could be a good way to go to improve our storage within nebari.

dcmcand avatar Jun 12 '24 15:06 dcmcand

It also allows object store mounting with an s3 compatible api, so we could replace minio here

dcmcand avatar Jun 12 '24 16:06 dcmcand

I'm looking into Ceph a bit in this issue

Adam-D-Lewis avatar Jun 21 '24 20:06 Adam-D-Lewis

PR using adding Ceph storage optionally replacing NFS - https://github.com/nebari-dev/nebari/pull/2541

Adam-D-Lewis avatar Aug 21 '24 14:08 Adam-D-Lewis

Some comments.

  1. QHub (i.e. before Nebari) used to use EFS on AWS and we tried both the default and high performance (extra IOPs) versions of EFS. It had performance issues similar to our current NFS system
  2. mounting block storage as a filesystem is likely to not work. S3 is not a POSIX compliant filesystem and certain file actions just don't work. i.e. sqlite databases cannot be read directly from s3 etc.
  3. docker / extracting from tarball on the fly isn't a great option. It exchanges painful build time for a painful startup time.

Going into 3 in more detail. Jupyterhub orginally was based on pulling docker images. This works fine but at the cost of longer startup times and environment switching times. Many complex ML environments are 10-15GB and pulling a pod that large takes a very long time. Additionally, adding new environments requires creating new containers and paying the cost of shutting down and starting the new containers. Adding the environments to a shared folder that is then mounted to the pod, reduces startup time of pods and also makes switching environments painless and instant. The tarball idea seems like if would take significant time to generate envs on startup - the conda-store worker would need to download all the tarballs from conda store and unpack them in the appropriate locations..

dharhas avatar Oct 31 '24 19:10 dharhas

  1. mounting block storage as a filesystem is likely to not work. S3 is not a POSIX compliant filesystem and certain file actions just don't work. i.e. sqlite databases cannot be read directly from s3 etc.

I agree with the above.

  1. docker / extracting from tarball on the fly isn't a great option. It exchanges painful build time for a painful startup time.

Going into 3 in more detail. Jupyterhub orginally was based on pulling docker images. This works fine but at the cost of longer startup times and environment switching times. Many complex ML environments are 10-15GB and pulling a pod that large takes a very long time. Additionally, adding new environments requires creating new containers and paying the cost of shutting down and starting the new containers. Adding the environments to a shared folder that is then mounted to the pod, reduces startup time of pods and also makes switching environments painless and instant.

I want to clarify a few points about the proposed tarball implementation. Think of the idea as moving the conda store worker pod to a container that runs along side each user's jupyterlab container. There is only a single conda store worker per user for all the envs so we don't need to start up and shut down containers as new envs are created. The conda store worker would be responsible for running conda solves, and for downloading env tarballs from object storage as needed.

As I benchmarked above, some of our current envs are up to 20 GB in size so I agree we need to consider large envs in any solution. I'm not sure how large the envs are compressed though. I ran a test on an 8.6G env locally and it was 3.2GB compressed as a tar.zst file.

The tarball idea seems like if would take significant time to generate envs on startup - the conda-store worker would need to download all the tarballs from conda store and unpack them in the appropriate locations..

I think a strategy where you only download the last few used environments at startup would likely be good enough rather than downloading all of them on startup. We would also download all the jupyter kernel spec files so that jupyterlab can identify which envs are available. We could then download the environment if not present and untar it when the user tries to use the env in a notebook or tries to activate the env in the terminal. This would cause a delay as the tarball is downloaded and uncompressed.

Assume a 20GB env becomes a 10GB tarball and a network speed of 10 Gbps (maybe too optimistic of a speed?) would take 8 seconds to download but decompressing the tarball would take some time as well (~7 seconds in my test of 3.2GB .tar.zst file. Also, zstd does support streaming decompression so decompression could happen in parallel with the download). Not ideal, but that doesn't seem like a deal breaker necessarily. I think this is more scalable than a shared filesystem in the case of high usage, but startup time for a env you don't already have would be a potential drawback.

I'm not sure it's the best approach and maybe ceph is good enough, and it would be quite a bit of work to make all these changes, but I think it could show promise.

Adam-D-Lewis avatar Oct 31 '24 20:10 Adam-D-Lewis

The issue is that it isn't just a case of unzipping a folder. conda-pack does a bunch of link rewriting inside binaries to make everything work. It might be worth experimenting with, especially since installing with something like pixi might be really fast on storage local to the pod.

dharhas avatar Oct 31 '24 20:10 dharhas

I'll close this issue since we expect Ceph to address it for now.

Adam-D-Lewis avatar Nov 05 '24 16:11 Adam-D-Lewis