InvokeAI
InvokeAI copied to clipboard
Minor Docker fixes & podman "rootless" container support (see comments)
I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.
This was tested on both Podman and Docker.
Notes:
- To run w/Podman, just set
CONTAINER_ENGINE="podman"(default is "docker") onbuild.shandrun.sh. Otherwise, everything should hopefully run as before. - For fun, it was also built on an Raspberry Pi 4 w/debian 11. No, I couldn't generate any images as the 8GB ram quickly filled up, but at least it built and got the web interface up and running.
- I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)
- To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.
Could someone with podman and/or docker test it? Even if y'all don't want ALL the commits here, hopefully some of it will be of value for Docker users too.
Enjoy!
I was intrigued by the Docker support and decided to try it. When it comes to containers, I always prefer Podman running "rootless" rather than Docker running as root, so made a few changes to support this as well.
Docker can also be used rootless: https://docs.docker.com/engine/security/rootless/
- I do have Podman running at full-speed with cuda on Podman on my machine, but I didn't want to dramatically break any of the current docker behavior and (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image. If there is interest I can clean up the other Dockerfile and provide it. I'm guessing it will work with Docker as well.)
People where already using this image with the cuda runtime: https://invoke-ai.github.io/InvokeAI/installation/040_INSTALL_DOCKER/
- To support Podman, the last commit f70fb02 pulls two Docker features that I really wanted to keep but don't yet have Podman support. These changes shouldn't break Docker's build, but may make subsequent builds less-efficient. I hope Podman 4.0 will support at least some of them. See the notes in that commit.
But it is rly no option to remove the build cache which is not only used by our CI/CD. Also it would be nice to have the linked copy job and not need to change default values for one build argument in three stages.
And btw: I always make sure that the built image is compatible with https://www.runpod.io, so maybe your problems could already be solved by pulling the built image from https://hub.docker.com/r/invokeai/invokeai instead of building it locally
Does runpod use podman or docker?
I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.
In the meantime, I may as well push the requested changes here, and you can decide if you want to use any parts of it. If not, I can always just host a "Podman/CUDA"-specific version for my own use. No big deal.
Does runpod use podman or docker?
I can try to pull the full image from docker hub and see what happens. The build issues at least should not be a facgtor. But I think I'll be stuck w/cpu until I can figure out how to use the existing image with the cuda runtime.
The latest tag is built for CUDA
[...] (and wasn't really sure how cuda was even working as-is). My cuda Dockerfile uses the nvidia/cuda ubuntu image as a base, not the python base image [...]
@fat-tire: The torch cuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn, cublas etc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massive nvidia/cuda image is redundant when using cuda-supporting torch - you can use any minimal image as needed. hope this saves you 2GB of base layer :)
Does runpod use podman or docker?
Runpod uses Kubernetes as far as I can tell (so it's containerd most likely), but generally any OCI-compliant image should work equally well with either docker, podman, or k8s. Not sure of the rootless implications specifically, but could build-time compatibility issues (buildkit syntax support, etc) be avoided by building with docker and running with podman? Or does that undermine anything for your use case, specifically needing to run docker as root?
@fat-tire: The
torchcuda-enabled distribution bundles the entirety of its CUDA dependencies (cuda,cudnn,cublasetc.) That is the reason for its ~1.8GB installed size for Linux (w/ cuda), vs mere megabytes for Mac (and Windows is somewhere in between). So you'll be happy to know that the massivenvidia/cudaimage is redundant when using cuda-supportingtorch- you can use any minimal image as needed. hope this saves you 2GB of base layer :)
Hey thanks for the response! I did try building with "cuda" set as the flavor (manually, just to be sure), and did notice a ton of packages coming in and as you suggest at first I assumed I had everything I needed to run w/cuda (as the docs said). But- no matter what I did and how I started it (again, this is with Podman) it kept coming up "cpu". Going to the container shell, running python and importing torch and checking if gpu is available, I constantly got "False". I thought maybe the problem was that the container didn't have accesss to the nvidia hardware, so I tried adding to the run.sh file all the
--device /dev/dri \
--device /dev/input \
--device /dev/nvidia0 \
--device /dev/nvidiactl \
--device /dev/nvidia-modeset \
--device /dev/nvidia-uvm \
--device /dev/nvidia-uvm-tools \
stuff, and I played with giving it --privileged permissions, added the GPU_FLAGS=ALL, manually added the --gpus set to all as well. Tried a bunch of combinations. Nothing was working.
As a last resort, I tried the nvidia/cuda repo as a base and boom it came up. Well, once i also installed the nvidia-driver matching my host driver, that is.
As for your suggestion to not worry re building in Podman and just use the prebuilt images-- sure that would be fine and that way there's no need to pull all those things that aren't working in Docker yet. I've yet to try pulling/running the pre-built docker-built container rather than building it myself but I'll give it a try in the next day or so and report back.
Thx again!
Had some time to do a Podman test with the officially prebaked image (latest tag) on docker hub-- here's the command to run it (note I'm using both ./mounts/outputs and ./mounts/invokeai for /data/outputs and /data just because I like to organize things in local folders rather than as a volume but this should make no difference really.)
Command to run:
GPU_FLAGS=all podman run --interactive --tty --rm --name=invokeai --hostname=invokeai \
--mount type=bind,source=./mounts/invokeai,target=/data \
--mount type=bind,source=./mounts/outputs,target=/data/outputs \
--publish=9090:9090 --user=appuser:appuser --gpus=all \
--device /dev/dri --device /dev/input --device /dev/nvidia0 \
--device /dev/nvidiactl --device /dev/nvidia-modeset \
--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
--cap-add=sys_nice invokeai/invokeai
Note I'm including explicit access to the nvidia devices just in case it's needed when running rootless (it's what I use when I'm running the nvidia/cuda image so I didn't want to make any changes since I know they provide access the container needs, For all I know --gpus=all might be enough but I didn't want to introduce anything different from the known-working command).
Running this command downloads the image from docker hub and runs, resulting in the (expected) error:
PermissionError: [Errno 13] Permission denied: '/data/models'
This was expected due to the uid/guid ownership issue discussed above.
The workaround was to run this 1x. It can only be run after the image is downloaded and the files/volumes are created:
#!/bin/bash
CONTAINER_ENGINE="podman"
# Podman only: set ownership for user 1000:1000 (appuser) the right way
# this fixes PermissionError: [Errno 13] Permission denied: '/data/models'
if [[ ${CONTAINER_ENGINE} == "podman" ]] ; then
echo Setting ownership for container\'s appuser on /data and /data/outputs
podman run \
--mount type=bind,source="$(pwd)/mounts/invokeai",target=/data \
--user root --entrypoint "/bin/chown" "invokeai" \
-R 1000:1000 /data
podman run \
--mount type=bind,source="$(pwd)"/mounts/outputs,target=/data/outputs \
--user root --entrypoint "/bin/chown" "invokeai" \
-R 1000:1000 /data/outputs
fi
Now that the mounted directories are correct, the above podman run command can work, the various models are loaded and the web server starts.
The problem though--
$ ./run.sh
Setting ownership for container's appuser on /data and /data/outputs
* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cpu
>> xformers not installed
>> Initializing NSFW checker
CPU. not CUDA.
From the container:
root@invokeai:/usr/src# python3
Python 3.9.16 (main, Feb 9 2023, 05:40:23)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
Also:
oot@invokeai:/usr/src# pip list | grep torch
clip-anytorch 2.5.0
pytorch-lightning 1.7.7
torch 1.13.1
torch-fidelity 0.3.0
torchdiffeq 0.2.3
torchmetrics 0.11.1
torchsde 0.2.5
torchvision 0.14.1
root@invokeai:/usr/src# pip list | grep nvidia
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
Contrast with my nvidia/cuda image-based container:
root@invokeai:/usr/src# python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
and
* Initializing, be patient...
>> Initialization file /data/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.1
>> InvokeAI runtime directory is "/data"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cuda
and
root@invokeai:/usr/src# pip list | grep torch
clip-anytorch 2.5.0
pytorch-lightning 1.7.7
torch 1.13.1+cu117
torch-fidelity 0.3.0
torchdiffeq 0.2.3
torchmetrics 0.11.1
torchsde 0.2.5
torchvision 0.14.1+cu117
# pip list | grep nvidia
root@invokeai:/usr/src#
Most obviously torch/torchvision have cu117 on my version but for some reason this didn't make it to the upstream container (or it wasn't properly d/l'd if it was supposed to be).
Thoughts?
For a moment I thought perhaps I was just using the wrong tag and getting the cpu version as a result,, but trying the :main-cuda tag came up "cpu" as well:
>> Using device_type cpu
If I pip uninstall torch torchvision and then pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 I can get the right version of torch in there. But it's not there by default.
root@invokeai:/data# pip list | grep cu117
torch 1.13.1+cu117
torchaudio 0.13.1+cu117
torchvision 0.14.1+cu117
Unfortunately, this didn't work. I also tried installing cu113 (no dice) and the cuda package from nvida directly to the base image per nvidia's instructions.
This didn't work either.
I don't know podman at all, sadly, just looked over some of the docs... but this is very curious indeed if the nvidia/cuda based images work correctly for you, but not an image with torch..+cu117 installed. Because as I mentioned above, Torch bundles all of the required cuda dependencies and works out of the box - that's been well tested, albeit only in Docker and Kubernetes.
I see you're mapping quite a lot of devices in one of the above commands. Are you certain to be running the invoke image identically to the known working nvidia/cuda?
Curious as to the mininal set of podman arguments required to get the nvidia/cuda image to see the GPU, and whether the same works with the invoke image... does podman run --device nvidia.com/gpu0 work for you? the docs seem to suggest this is the correct way
Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the nvidia/cuda run.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added an echo to the front of the RUN command, then used the output of that with the invokeai prebaked image so I knew I was running it just as it was running the run.sh file. Also tried prepending with GPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.
Trying with --device nvidia.com/gpu0 returned a Error: stat nvidia.com/gpu0: no such file or directory failure. I hadn't seen this syntax before, but didn't seem to work.
Curious why would the main-cuda tagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing with pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117 they did appear in pip, but again, no dice getting "cuda" to appear when running invokeai --web right afterwards:
appuser@invokeai:/usr/src$ pip list | grep torch
clip-anytorch 2.5.0
pytorch-lightning 1.7.7
torch 1.13.1+cu117
torch-fidelity 0.3.0
torchdiffeq 0.2.3
torchmetrics 0.11.1
torchsde 0.2.5
torchvision 0.14.1+cu117
appuser@invokeai:/usr/src$ python
Python 3.9.16 (main, Feb 9 2023, 05:40:23)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> quit()
Thanks for taking a look. Yeah I'm sure I am running it the same way as I copy/pasted the RUN command from the
nvidia/cudarun.sh, minus some QT_FONT_DPI cruft I had that I'm pretty sure isn't needed. I literally added anechoto the front of theRUNcommand, then used the output of that with theinvokeaiprebaked image so I knew I was running it just as it was running therun.shfile. Also tried prepending withGPU_FLAGS=all. It always comes up "cpu" no matter what I seem to do.Trying with
--device nvidia.com/gpu0returned aError: stat nvidia.com/gpu0: no such file or directoryfailure. I hadn't seen this syntax before, but didn't seem to work.Curious why would the
main-cudatagged image includes versions 1.13.1 and 0.14.1 of torch/torchvision by default rather than the cu117 versions? Again, after running, then manually removing these versions and re-installing withpip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117they did appear inpip, but again, no dice getting "cuda" to appear when runninginvokeai --webright afterwards:appuser@invokeai:/usr/src$ pip list | grep torch clip-anytorch 2.5.0 pytorch-lightning 1.7.7 torch 1.13.1+cu117 torch-fidelity 0.3.0 torchdiffeq 0.2.3 torchmetrics 0.11.1 torchsde 0.2.5 torchvision 0.14.1+cu117 appuser@invokeai:/usr/src$ python Python 3.9.16 (main, Feb 9 2023, 05:40:23) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.is_available() False >>> quit()
I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.
I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.
Well I have some good news.. I got it running from the pre-built on podman/rootless! What's nice is that the image is significantly smaller, even with the nvidia drivers installed, than the previous container I was using, and rebuilding will be VERY fast now.
Steps:
- Modify
Dockerfileto startFROMthe upstream prebuilt image. Then apt-get installcurlandkmodpackages, then curl/download and install the Nvidia driver into the container. (The container driver version must match the host driver version.) Then clean up the installer. - Create a new
build.shto determine in the correct nvidia driver version since it will change in future, then build the image withbuildah bud, passing in the driver info. - Create a
run.shscript which MUST explicitly allow access to the various nvidia devices (or elsenvidia-smigives a "can't connect" error. Before every run it makes sure the mounted volumes/binds have correct 1000:1000 ownership for the local user to avoidfixes PermissionError: [Errno 13] Permission denied: '/data/models'podman errors.
I have all of the above working with the latest "3.0.0+a0" version.
Does anyone want this code... or what should I do with it? Since the real work is done in the prebuilt, these are all very small, simple files. With a $CONTAINER_ENGINE flag, it could be integrated into the source in this repo.. but I can also make a dedicated podman_invokeai repository solely for this purpose (it would be maybe 4 files and a README). I don't know if there other other podman users who would be interested or not.
I didn't have to touch /etc/nvidia-container-runtime/ anything, I didn't even have to install the nvidia-container-runtime package on the host at all, though I am running the 525 driver, so I think it's included now.
I don't know if it's still necessary, but the last time I needed to use graphics acceleration inside a podman container (fedora silverblue, to use an old version of invoke) I had to install the nvidia-container-runtime package (which only has a version for RHEL 8.3, but it works on fedora) on the host system, plus the respective gpu drivers inside the container as well. Also, I had to edit /etc/nvidia-container-runtime/config.toml and set no-cgroups = true.
I don't have an Nvidia gpu anymore, I don't have anything to test with, but maybe this is the way to go.
Great to hear you got it working! If there was a way to run this without installing the nvidia driver into the image, that would be ideal, in my opinion. Generally you really want to be using the driver that is already loaded by the kernel. But perhaps that's a hard limitation due to podman's rootless nature - i'm not sure.
I think your work here is valuable for supporting users who wish to run in a rootless container. Is there a way to do this without maintaining a separate Dockerfile and build/run scripts? Will leave it up to @mauwii to make the call on how to proceed next.
I already addressed a lot of changes (see the 11 unresolved conversations) and made clear that I would not want to remove the caching 😅
Thanks for addressing the changes in the other convos--
You wouldn't have to remove the caching as podman does now run with the prebaked image (built with caching, --linking etc). My working Dockerfile builds on top of that by adding the NVIDIA drivers on top. It basically looks like this:
FROM docker.io/invokeai/invokeai:main-cuda
ARG ARCH=x86_64
ARG NVIDIA_VERSION=525.85
USER 0
RUN apt update && apt install -y kmod curl
RUN cd /tmp && curl https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${NVIDIA_VERSION}/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run -o /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
&& bash /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run --no-kernel-module --no-kernel-module
-source --run-nvidia-xconfig --no-backup --no-questions --accept-license --ui=none \
&& rm -f /tmp/NVIDIA-Linux-${ARCH}-${NVIDIA_VERSION}.run \
&& rm -rf /tmp/*
RUN apt remove --purge kmod curl -y && apt-get clean
The build.sh looks something like:
#!/bin/bash
CONTAINER_BUILD="buildah bud"
TAG=latest
if [ -z "$RESOLVE_NVIDIA_VERSION" ]; then
export NVIDIA_VERSION=`nvidia-smi --query-gpu=driver_version --format=csv,noheader`
else
export NVIDIA_VERSION="${RESOLVE_NVIDIA_VERSION}"
fi
${CONTAINER_BUILD} -t "invokeai:${TAG}" -t "invokeai" --build-arg ARCH=`uname -m` --build-arg NVIDIA_VERSION="${NVIDIA_VERSION}"
As you can see it passes the current NVIDIA driver and ARCH to the build command. The container has to match the host, so this may be an issue for making any generic image for rootless.
I did try avoiding installing the nvidia-driver and instead tried using only the nvidia-container-runtime package in the Dockerfile. (Notes for automating that installation are here.)
RUN apt update && apt install gpg curl -y
RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
RUN apt update && apt install -y nvidia-container-runtime
RUN apt remove --purge kmod curl gpg -y && apt-get clean
But this did test NOT work, at least not for me.
The run.sh is pretty much the same, except these --device lines were needed and the couple lines in this PR that verify the user permissions of the mount/bind volumes/directories, which I guess is a rootless thing to make sure that the userid & usergroup for the user running in the container can access the volume/directories.
--device /dev/dri --device /dev/input --device /dev/nvidia0 \
--device /dev/nvidiactl --device /dev/nvidia-modeset \
--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
That's basically all I got! Just happy it's working now, and, while I don't know what if anything above can be integrated, hopefully someone else running rootlessly can find value in it.
Hey it's been a few weeks and I'm inclined to close this just to not clutter up the PR area. Is there anything anyone wants from here? I've got podman running locally w/the method outlined above-- one thing I did add recently is:
--env TRANSFORMERS_CACHE=/data/.cache \
to the podman run command, as it wasn't set anywhere else.
I've also noticed I get this:
Server error: [Errno 18] Invalid cross-device link:
When trying to delete an image via the trash can icon in the web ui, because image files can't apparently be moved from /data/outputs/ to /data/.Trash-1000/files/. I had this problem previously. Dunno if it's podman only or an issue with the mounts or what.
@fat-tire I'm going to close this PR as outdated - we'll have it for reference if/when implementing Podman support, as discussed in #3587