docker-stacks #1557 Add cuda tag prefix for pytorch-notebook

Describe your changes

This reworks the build pipeline to allow building CUDA variants for the pytorch-notebook image. Examples images: https://quay.io/repository/ai-hpi/pytorch-notebook

Commit "feat: build cuda variants of pytorch" (43f2d320): minimal example for how to add cuda variants for the pytorch-notebook image.
Commit "feat: build with variant tag" (6198575d): rework the pipeline to build these images under the tag of the pytorch-notebook instead (adds a "cuda-" and "cuda11-" prefix). This logic is implemented by adding a new "variant" argument for the GitHub Actions and the tagging scripts. I also refactored docker.yaml to get rid of the redundancy for the aarch64 vs x86_64 build.

In the issue there were discussions about using the official CUDA images instead of ubuntu:22.04 as the root container. This is not desirable in my opinion, as these images are not updated frequently and are based on older ubuntu:22.04 images. Instead a simpler approach works: packages like pytorch and tensorflow bundle their own CUDA binaries via pip these days.

I looked into CUDA builds for TensorFlow and PyTorch (I am not sure if CUDA support is really a priority for other notebooks). In order to support NVIDIA driver versions older than 525, builds for CUDA 11.8 should also be included.

For PyTorch (https://pytorch.org/get-started/locally/) everything works as expected. I also added NVIDIA's pip index for their most recent packages as an install argument --extra-index-url=https://pypi.nvidia.com, but it's likely not needed.
For TensorFlow (https://www.tensorflow.org/install/source?hl=en#gpu) there were multiple issues:
- CUDA 11: The most recent version (2.15.0) updated the CUDA dependencies from 11.8 to 12.2. So the last version which supports CUDA 11.8 is 2.14.0. However, that version is not functional under python 3.11, because a dependency (tensorrt==8.5.3.1) is not available for python 3.11: https://github.com/tensorflow/tensorflow/issues/61986
- CUDA 12 (latest): Using pip install --no-cache-dir --extra-index-url=https://pypi.nvidia.com tensorrt tensorflow[and-cuda] didn't seem to work for me and tensorflow was unable to find tensorrt. I assume the workaround in the linked issue would work, but it seems better to either manually build the package, try nightly builds, or wait for 2.16.0.

Issue ticket if applicable

Fix (for pytorch): https://github.com/jupyter/docker-stacks/issues/1557

Checklist (especially for first-time contributors)

[x] I have performed a self-review of my code
[x] If it is a core feature, I have added thorough tests
- [x] Example build: https://github.com/johanna-reiml-hpi/docker-stacks/actions/runs/7788922093
- [x] Example images: https://quay.io/repository/ai-hpi/pytorch-notebook?tab=tags
- [x] No aarch64 tests yet (I assume you want to use your self-hosted runners)
[x] I will try not to use force-push to make the review process easier for reviewers
[x] I have updated the documentation for significant changes
- [x] Not sure what parts need to be updated. I mentioned the existence of optional CUDA tags under the pytorch-notebook section.

Feb 06 '24 11:02 johanna-reiml-hpi

From what I can see the build for aarch64 seems to fail due to running out of disk space (seems to error when archiving the image, I had similar errors for CUDA builds running out of disk space). Is there an issue with also enabling this part for the self-hosted aarch64 runners? Right now it only runs for x86_64.

      # Image with CUDA needs extra disk space
      - name: Free disk space 🧹
        if: contains(inputs.variant, 'cuda') && inputs.platform == 'x86_64'
        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be
        with:
          tool-cache: false
          android: true
          dotnet: true
          haskell: true
          large-packages: false
          docker-images: false
          swap-storage: false

Feb 06 '24 23:02 johanna-reiml-hpi

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh

Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

Feb 08 '24 09:02 mathbunnyru

@johanna-reiml-hpi I think the quality of this PR is awesome 👍 Great job. I left a few comments of some things that may be improved, please, take a look.

We now have a policy for merging new images and adding new packages 🎉

@mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message) The voting deadline is the 8th of March (a month since I posted this message). The change is accepted, if there are at least 2 positive votes.

We can have a discussion until the deadline, so please express your opinions.

Feb 08 '24 17:02 mathbunnyru

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh

Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

Feb 11 '24 21:02 johanna-reiml-hpi

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

I cleaned up the space on our aarch64 runners. Unfortunately, Docker has an issue, where cleaning up doesn't remove everything it should, so had to clean up manually.

Feb 12 '24 09:02 mathbunnyru

It would be nice to have a GPU version of tensorflow as well. I saw the comment in the description, so thank you for trying.

I would suggest trying to install conda-forge version of tensorflow package as well when we merge this PR. If it works, create another PR after this will be merged (should be small and easy), if not, nothing to do. In both cases, please, write if it worked or not (in this PR).

Feb 12 '24 09:02 mathbunnyru

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

I cleaned up the space on our aarch64 runners. Unfortunately, Docker has an issue, where cleaning up doesn't remove everything it should, so had to clean up manually.

Seems to work - the build is green 🎉

Feb 12 '24 11:02 mathbunnyru

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released.

So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version.

What do you think? I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

Feb 12 '24 12:02 mathbunnyru

I had another look at the official pytorch Dockerfile, CUDA Dockerfiles and nvidia documentation for some env variables, so here are a few extra points to consider:

The following ENV should be added to the Dockerfile according to the linked NVIDIA documentation:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

I don't think the CUDA installation for aarch64 worked actually. The CUDA variant build times are identical to the CPU ones. I checked the official install guide and when selecting MacOS you get: # CUDA is not available on MacOS, please use default package. The official pytorch Dockerfile also only installs the CPU version for aarch64. I will exclude aarch64 from this PR.
The official pytorch Dockerfile installs the CUDA dependencies via conda. This adds CUDA binaries to conventional paths instead of having CUDA as a pip package (ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64, ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH). I am not sure which approach is better. The current version might be nicer in terms of simplicity (and follows the official install guide).

Feb 12 '24 22:02 johanna-reiml-hpi

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released.

So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version.

What do you think? I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

Seems okay to add. However, it's also possible that pytorch might add a --index-url 'https://download.pytorch.org/whl/cu122' alongside the current --index-url 'https://download.pytorch.org/whl/cu121'. In that case image variants would have to be built for both versions. I think it might be less confusing to do something like this instead:

jupyter/pytorch-notebook:cuda12 (latest, so cuda12.2)
jupyter/pytorch-notebook:cuda12.2
jupyter/pytorch-notebook:cuda12.1

Feb 12 '24 22:02 johanna-reiml-hpi

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released. So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version. What do you think? I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

Seems okay to add. However, it's also possible that pytorch might add a --index-url 'https://download.pytorch.org/whl/cu122' alongside the current --index-url 'https://download.pytorch.org/whl/cu121'. In that case image variants would have to be built for both versions. I think it might be less confusing to do something like this instead:

jupyter/pytorch-notebook:cuda12 (latest, so cuda12.2)

jupyter/pytorch-notebook:cuda12.2

jupyter/pytorch-notebook:cuda12.1

If you can implement this as purely a tag feature, and not a whole new image - then it would be great. It might be easier to implement it as I mentioned though (adding a new tagger which adds a more explicit cuda version).

I don't want to have a separate image for each X.Y version of cuda. We usually only support latest versions of packages. In case of cuda it makes sense to support both cuda11 and cuda12, but having to always support cuda12.Y is too much.

Feb 12 '24 22:02 mathbunnyru

@consideRatio @manics please, vote here: https://github.com/jupyter/docker-stacks/pull/2091#issuecomment-1934654952

I am ready to merge this before the vote deadline if I receive one more positive vote.

Feb 14 '24 09:02 mathbunnyru

Vote to drop CUDA ambitions for tensorflow-notebook

If someone manages to enable cuda-enabled tensorflow image for our docker stacks in a sane way (installing pip/mamba packages, no manual build), then it means that the ecosystem is mature enough for us to support it. I won't be against merging such a change. In any case, after merging this PR, it should be easy to experiment.

I mostly agree with @consideRatio's "Misc PR review" section - @johanna-reiml-hpi please, take a look.

Supported architectures

👍

Variant description

I prefer not to change anything here (I really like all the cuda11 image tags have a long common prefix like jupyter/pytorch-notebook:cuda11-)

PR Title

👍

Cuda policy

👍

Feb 21 '24 20:02 mathbunnyru

Ah, i thought it was a suffix seeing f"{platform}-{variant}" somewhere, and that it should be seen as a suffix being put last there. But I think i misunderstood this and that its really a high prio prefix in the tag, so it makes sense to reference it as a prefix still.

I think the description could be improved on still though as i struggled to piece together things. Now i think something like "Some images has variants, they will be published with the same image name, but get a prefix added to the tag." could have been helpful to read somewhere.

Feb 21 '24 20:02 consideRatio

The code is great, and the voting is also successful, so I am merging this one - I will add a policy and some documentation myself.

Feb 24 '24 10:02 mathbunnyru

Done: https://github.com/jupyter/docker-stacks/pull/2097/files

Feb 24 '24 10:02 mathbunnyru

New images are pushed and ready to use 🎉 https://quay.io/repository/jupyter/pytorch-notebook?tab=tags

Feb 24 '24 11:02 mathbunnyru

@johanna-reiml-hpi thank you for this PR. It's nice to have GPU enabled image at least for pytorch and I think variant feature might be used in other circumstances.

Feb 24 '24 20:02 mathbunnyru