InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

Minor Docker fixes & podman "rootless" container support (see comments)

Open fat-tire opened this issue 2 years ago • 3 comments

I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.

This was tested on both Podman and Docker.

Notes:

  • To run w/Podman, just set CONTAINER_ENGINE="podman" (default is "docker") on build.sh and run.sh. Otherwise, everything should hopefully run as before.
  • For fun, it was also built on an Raspberry Pi 4 w/debian 11. No, I couldn't generate any images as the 8GB ram quickly filled up, but at least it built and got the web interface up and running.
  • I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)
  • To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.

Could someone with podman and/or docker test it? Even if y'all don't want ALL the commits here, hopefully some of it will be of value for Docker users too.

Enjoy!

fat-tire avatar Feb 19 '23 04:02 fat-tire

I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.

Docker can also be used rootless: https://docs.docker.com/engine/security/rootless/

  • I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)

People where already using this image with the cuda runtime: https://invoke-ai.github.io/InvokeAI/installation/040_INSTALL_DOCKER/

  • To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.

But it is rly no option to remove the build cache which is not only used by our CI/CD. Also it would be nice to have the linked copy job and not need to change default values for one build argument in three stages.

And btw: I always make sure that the built image is compatible with https://www.runpod.io, so maybe your problems could already be solved by pulling the built image from https://hub.docker.com/r/invokeai/invokeai instead of building it locally

mauwii avatar Feb 19 '23 06:02 mauwii

Does runpod use podman or docker?

I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.

In the meantime, I may as well push the requested changes here, and you can decide if you want to use any parts of it. If not, I can always just host a "Podman/CUDA"-specific version for my own use. No big deal.

fat-tire avatar Feb 19 '23 06:02 fat-tire

Does runpod use podman or docker?

I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.

The latest tag is built for CUDA

mauwii avatar Feb 19 '23 06:02 mauwii

[...] (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image [...]

@fat-tire: The torch cuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn, cublas etc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massive nvidia/cuda image is redundant when using cuda-supporting torch - you can use any minimal image as needed. hope this saves you 2GB of base layer :)

ebr avatar Feb 20 '23 07:02 ebr

Does runpod use podman or docker?

Runpod uses Kubernetes as far as I can tell (so it's containerd most likely), but generally any OCI-compliant image should work equally well with either docker, podman, or k8s. Not sure of the rootless implications specifically, but could build-time compatibility issues (buildkit syntax support, etc) be avoided by building with docker and running with podman? Or does that undermine anything for your use case, specifically needing to run docker as root?

ebr avatar Feb 20 '23 07:02 ebr

@fat-tire: The torch cuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn, cublas etc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massive nvidia/cuda image is redundant when using cuda-supporting torch - you can use any minimal image as needed. hope this saves you 2GB of base layer :)

Hey thanks for the response! I did try building with "cuda" set as the flavor (manually, just to be sure), and did notice a ton of packages coming in and as you suggest at first I assumed I had everything I needed to run w/cuda (as the docs said). But- no matter what I did and how I started it (again, this is with Podman) it kept coming up "cpu". Going to the container shell, running python and importing torch and checking if gpu is available, I constantly got "False". I thought maybe the problem was that the container didn't have accesss to the nvidia hardware, so I tried adding to the run.sh file all the

   --device /dev/dri \
     --device /dev/input \
     --device /dev/nvidia0 \
     --device /dev/nvidiactl \
     --device /dev/nvidia-modeset \
     --device /dev/nvidia-uvm \
     --device /dev/nvidia-uvm-tools \

stuff, and I played with giving it --privileged permissions, added the GPU_FLAGS=ALL, manually added the --gpus set to all as well. Tried a bunch of combinations. Nothing was working.

As a last resort, I tried the nvidia/cuda repo as a base and boom it came up. Well, once i also installed the nvidia-driver matching my host driver, that is.

As for your suggestion to not worry re building in Podman and just use the prebuilt images-- sure that would be fine and that way there's no need to pull all those things that aren't working in Docker yet. I've yet to try pulling/running the pre-built docker-built container rather than building it myself but I'll give it a try in the next day or so and report back.

Thx again!

fat-tire avatar Feb 20 '23 20:02 fat-tire

Had some time to do a Podman test with the officially prebaked image (latest tag) on docker hub-- here's the command to run it (note I'm using both ./mounts/outputs and ./mounts/invokeai for /data/outputs and /data just because I like to organize things in local folders rather than as a volume but this should make no difference really.)

Command to run:

GPU_FLAGS=all podman run --interactive --tty --rm --name=invokeai --hostname=invokeai \
--mount type=bind,source=./mounts/invokeai,target=/data \
--mount type=bind,source=./mounts/outputs,target=/data/outputs \
--publish=9090:9090 --user=appuser:appuser --gpus=all \
--device /dev/dri --device /dev/input --device /dev/nvidia0 \
--device /dev/nvidiactl --device /dev/nvidia-modeset \
--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
--cap-add=sys_nice invokeai/invokeai

Note I'm including explicit access to the nvidia devices just in case it's needed when running rootless (it's what I use when I'm running the nvidia/cuda image so I didn't want to make any changes since I know they provide access the container needs, For all I know --gpus=all might be enough but I didn't want to introduce anything different from the known-working command).

Running this command downloads the image from docker hub and runs, resulting in the (expected) error:

PermissionError: [Errno 13] Permission denied: '/data/models'

This was expected due to the uid/guid ownership issue discussed above.

The workaround was to run this 1x. It can only be run after the image is downloaded and the files/volumes are created:

#!/bin/bash
CONTAINER_ENGINE="podman"

# Podman only:  set ownership for user 1000:1000 (appuser) the right way
# this fixes PermissionError: [Errno 13] Permission denied: '/data/models'
if [[ ${CONTAINER_ENGINE} == "podman" ]] ; then
   echo Setting ownership for container\'s appuser on /data and /data/outputs
   podman run \
      --mount type=bind,source="$(pwd)/mounts/invokeai",target=/data \
      --user root --entrypoint "/bin/chown" "invokeai" \
      -R 1000:1000 /data
   podman run \
      --mount type=bind,source="$(pwd)"/mounts/outputs,target=/data/outputs \
      --user root --entrypoint "/bin/chown" "invokeai" \
      -R 1000:1000 /data/outputs
fi

Now that the mounted directories are correct, the above podman run command can work, the various models are loaded and the web server starts.

The problem though--

$ ./run.sh 
Setting ownership for container's appuser on /data and /data/outputs
* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cpu
>> xformers not installed
>> Initializing NSFW checker

CPU. not CUDA.

From the container:

root@invokeai:/usr/src# python3
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

Also:

oot@invokeai:/usr/src# pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1
root@invokeai:/usr/src#  pip list | grep nvidia
nvidia-cublas-cu11          11.10.3.66
nvidia-cuda-nvrtc-cu11      11.7.99
nvidia-cuda-runtime-cu11    11.7.99
nvidia-cudnn-cu11           8.5.0.96

Contrast with my nvidia/cuda image-based container:

root@invokeai:/usr/src# python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

and

* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cuda

and

root@invokeai:/usr/src# pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
# pip list | grep nvidia
root@invokeai:/usr/src# 

Most obviously torch/torchvision have cu117 on my version but for some reason this didn't make it to the upstream container (or it wasn't properly d/l'd if it was supposed to be).

Thoughts?

fat-tire avatar Feb 26 '23 21:02 fat-tire

For a moment I thought perhaps I was just using the wrong tag and getting the cpu version as a result,, but trying the :main-cuda tag came up "cpu" as well:

>> Using device_type cpu

If I pip uninstall torch torchvision and then pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 I can get the right version of torch in there. But it's not there by default.

root@invokeai:/data# pip list | grep cu117
torch                       1.13.1+cu117
torchaudio                  0.13.1+cu117
torchvision                 0.14.1+cu117

Unfortunately, this didn't work. I also tried installing cu113 (no dice) and the cuda package from nvida directly to the base image per nvidia's instructions.

This didn't work either.

fat-tire avatar Feb 28 '23 02:02 fat-tire

I don't know podman at all, sadly, just looked over some of the docs... but this is very curious indeed if the nvidia/cuda based images work correctly for you, but not an image with torch..+cu117 installed. Because as I mentioned above, Torch bundles all of the required cuda dependencies and works out of the box - that's been well tested, albeit only in Docker and Kubernetes.

I see you're mapping quite a lot of devices in one of the above commands. Are you certain to be running the invoke image identically to the known working nvidia/cuda?

Curious as to the mininal set of podman arguments required to get the nvidia/cuda image to see the GPU, and whether the same works with the invoke image... does podman run --device nvidia.com/gpu0 work for you? the docs seem to suggest this is the correct way

ebr avatar Mar 04 '23 06:03 ebr

Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the nvidia/cuda run.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added an echo to the front of the RUN command, then used the output of that with the invokeai prebaked image so I knew I was running it just as it was running the run.sh file. Also tried prepending with GPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.

Trying with --device nvidia.com/gpu0 returned a Error: stat nvidia.com/gpu0: no such file or directory failure. I hadn't seen this syntax before, but didn't seem to work.

Curious why would the main-cuda tagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing with pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 they did appear in pip, but again, no dice getting "cuda" to appear when running invokeai --web right afterwards:

appuser@invokeai:/usr/src$ pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
appuser@invokeai:/usr/src$ python
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()

fat-tire avatar Mar 04 '23 07:03 fat-tire

Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the nvidia/cuda run.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added an echo to the front of the RUN command, then used the output of that with the invokeai prebaked image so I knew I was running it just as it was running the run.sh file. Also tried prepending with GPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.

Trying with --device nvidia.com/gpu0 returned a Error: stat nvidia.com/gpu0: no such file or directory failure. I hadn't seen this syntax before, but didn't seem to work.

Curious why would the main-cuda tagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing with pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 they did appear in pip, but again, no dice getting "cuda" to appear when running invokeai --web right afterwards:

appuser@invokeai:/usr/src$ pip list | grep torch
clip-anytorch               2.5.0
pytorch-lightning           1.7.7
torch                       1.13.1+cu117
torch-fidelity              0.3.0
torchdiffeq                 0.2.3
torchmetrics                0.11.1
torchsde                    0.2.5
torchvision                 0.14.1+cu117
appuser@invokeai:/usr/src$ python
Python 3.9.16 (main, Feb  9 2023, 05:40:23) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()

I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.

I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.

kriptolix avatar Mar 05 '23 01:03 kriptolix

Well I have some good news.. I got it running from the pre-built on podman/rootless! What's nice is that the image is significantly smaller, even with the nvidia drivers installed, than the previous container I was using, and rebuilding will be VERY fast now.

Steps:

  1. Modify Dockerfile to start FROM the upstream prebuilt image. Then apt-get install curl and kmod packages, then curl/download and install the Nvidia driver into the container. (The container driver version must match the host driver version.) Then clean up the installer.
  2. Create a new build.sh to determine in the correct nvidia driver version since it will change in future, then build the image with buildah bud, passing in the driver info.
  3. Create a run.sh script which MUST explicitly allow access to the various nvidia devices (or else nvidia-smi gives a "can't connect" error. Before every run it makes sure the mounted volumes/binds have correct 1000:1000 ownership for the local user to avoid fixes PermissionError: [Errno 13] Permission denied: '/data/models' podman errors.

I have all of the above working with the latest "3.0.0+a0" version.

Does anyone want this code... or what should I do with it? Since the real work is done in the prebuilt, these are all very small, simple files. With a $CONTAINER_ENGINE flag, it could be integrated into the source in this repo.. but I can also make a dedicated podman_invokeai repository solely for this purpose (it would be maybe 4 files and a README). I don't know if there other other podman users who would be interested or not.

I didn't have to touch /etc/nvidia-container-runtime/ anything, I didn't even have to install the nvidia-container-runtime package on the host at all, though I am running the 525 driver, so I think it's included now.

I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.

I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.

fat-tire avatar Mar 07 '23 05:03 fat-tire

Great to hear you got it working! If there was a way to run this without installing the nvidia driver into the image, that would be ideal, in my opinion. Generally you really want to be using the driver that is already loaded by the kernel. But perhaps that's a hard limitation due to podman's rootless nature - i'm not sure.

I think your work here is valuable for supporting users who wish to run in a rootless container. Is there a way to do this without maintaining a separate Dockerfile and build/run scripts? Will leave it up to @mauwii to make the call on how to proceed next.

ebr avatar Mar 07 '23 15:03 ebr

I already addressed a lot of changes (see the 11 unresolved conversations) and made clear that I would not want to remove the caching 😅

mauwii avatar Mar 07 '23 19:03 mauwii

Thanks for addressing the changes in the other convos--

You wouldn't have to remove the caching as podman does now run with the prebaked image (built with caching, --linking etc). My working Dockerfile builds on top of that by adding the NVIDIA drivers on top. It basically looks like this:

FROM docker.io/invokeai/invokeai:main-cuda

ARG ARCH=x86_64
ARG NVIDIA_VERSION=525.85

USER 0
RUN apt update && apt install -y kmod curl
RUN cd /tmp && curl https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${NVIDIA_VERSION}/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run -o /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
       && bash /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run --no-kernel-module --no-kernel-module
-source --run-nvidia-xconfig --no-backup --no-questions --accept-license --ui=none \
       && rm -f /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
       && rm -rf /tmp/*
RUN apt remove --purge kmod curl -y && apt-get clean

The build.sh looks something like:

#!/bin/bash

CONTAINER_BUILD="buildah bud"
TAG=latest

if [ -z "$RESOLVE_NVIDIA_VERSION" ]; then
   export NVIDIA_VERSION=`nvidia-smi --query-gpu=driver_version --format=csv,noheader`
else
   export NVIDIA_VERSION="${RESOLVE_NVIDIA_VERSION}"
fi

${CONTAINER_BUILD} -t "invokeai:${TAG}" -t "invokeai" --build-arg ARCH=`uname -m` --build-arg NVIDIA_VERSION="${NVIDIA_VERSION}"

As you can see it passes the current NVIDIA driver and ARCH to the build command. The container has to match the host, so this may be an issue for making any generic image for rootless.

I did try avoiding installing the nvidia-driver and instead tried using only the nvidia-container-runtime package in the Dockerfile. (Notes for automating that installation are here.)

RUN apt update && apt install gpg curl -y
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
RUN apt update && apt install -y nvidia-container-runtime
RUN apt remove --purge kmod curl gpg -y && apt-get clean

But this did test NOT work, at least not for me.

The run.sh is pretty much the same, except these --device lines were needed and the couple lines in this PR that verify the user permissions of the mount/bind volumes/directories, which I guess is a rootless thing to make sure that the userid & usergroup for the user running in the container can access the volume/directories.

  --device /dev/dri --device /dev/input --device /dev/nvidia0 \
  --device /dev/nvidiactl --device /dev/nvidia-modeset \
  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \

That's basically all I got! Just happy it's working now, and, while I don't know what if anything above can be integrated, hopefully someone else running rootlessly can find value in it.

fat-tire avatar Mar 08 '23 00:03 fat-tire

Hey it's been a few weeks and I'm inclined to close this just to not clutter up the PR area. Is there anything anyone wants from here? I've got podman running locally w/the method outlined above-- one thing I did add recently is:

  --env TRANSFORMERS_CACHE=/data/.cache \

to the podman run command, as it wasn't set anywhere else.

I've also noticed I get this:

 Server error: [Errno 18] Invalid cross-device link:

When trying to delete an image via the trash can icon in the web ui, because image files can't apparently be moved from /data/outputs/ to /data/.Trash-1000/files/. I had this problem previously. Dunno if it's podman only or an issue with the mounts or what.

fat-tire avatar Mar 17 '23 04:03 fat-tire

@fat-tire I'm going to close this PR as outdated - we'll have it for reference if/when implementing Podman support, as discussed in #3587

ebr avatar Jul 28 '23 00:07 ebr