InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

Optimized Docker build with support for external working directory

Open ebr opened this issue 3 years ago • 53 comments

What this does

  • Adds a new Dockerfile.cloud intended for use with remote deployments. Extensively tested on Linux (x86_64 / CUDA), but not macOS or Windows. (Hence a new Dockerfile.cloud, to avoid surprises to users of current docker setup).
  • Docker image does not bake in the models, and is 4 layers deep (2.5GB compressed, mainly pytorch). Utilizes docker build-time cache and multi-stage builds. Requires Docker Buildkit for caching (DOCKER_BUILDKIT=1).
  • The runtime directory (INVOKEAI_ROOT which contains models, config, etc) is expected to be mounted into the container, allowing for seamless upgrades with no data loss
  • Adds Github actions for automated image building and pushing. No special privileges or secrets are required for this. If this is merged, it will continuously build & push a ghcr.io/invokeai image. Github actions and package storage are free for open-source projects. Because no models are bundled, this is compliant with existing licensing and may be freely publicised and distributed.

Use this on Runpod

Try this template: https://runpod.io/gsc?template=vm19ukkycf&ref=mk65wpsa (should be self-explanatory - see README :laughing:)

At a high-level:

  • run the pod with an interactive shell to see the runtime directory;
  • stop the pod and run again, this time with the web UI.

Testing/usage locally (Linux only right now!):

PR includes a Makefile for easy building/running/demo purpose. If desirable, this can be easily rewritten as a shell script or docker-compose.

  • cd docker-build
  • make build
  • make configure (the usual configuration flow will be executed, including the prompt for HF token)
  • make cli or make web
  • access the web UI on http://localhost:9090
  • examine the ~/invokeai directory which will be populated with the expected INVOKEAI_ROOT contents
  • the location of the INVOKEAI_ROOT may be changed by setting the env var as usual

Caveats

Some files in the runtime dir (e.g. outputs) may be owned by the root user. A fix for this is upcoming; in the meantime sudo chown -R $(id -u):$(id -g) ~/invokeai can be used to fix ownership

ebr avatar Nov 24 '22 05:11 ebr

thank you very much for this.

i am a docker noob, is it expected that this fails to run on mac?

 => ERROR [builder 5/5] RUN --mount=type=cache,target=/root/.cache/pip     cp installer/py3.10-linux-x86_64-cuda-r  2.8s
------
 > [builder 5/5] RUN --mount=type=cache,target=/root/.cache/pip     cp installer/py3.10-linux-x86_64-cuda-reqs.txt requirements.txt &&     python3 -m venv /invokeai/.venv &&    pip install --extra-index-url https://download.pytorch.org/whl/cu116         torch==1.12.0+cu116         torchvision==0.13.0+cu116 &&    pip install -r requirements.txt &&    pip install -e .:
#11 1.982 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
#11 2.723 ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu116 (from versions: 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0)
#11 2.723 ERROR: No matching distribution found for torch==1.12.0+cu116
------
executor failed running [/bin/sh -c cp installer/py3.10-linux-x86_64-cuda-reqs.txt requirements.txt &&     python3 -m venv ${VIRTUAL_ENV} &&    pip install --extra-index-url https://download.pytorch.org/whl/cu116         torch==1.12.0+cu116         torchvision==0.13.0+cu116 &&    pip install -r requirements.txt &&    pip install -e .]: exit code: 1
make: *** [build] Error 1

damian0815 avatar Nov 24 '22 16:11 damian0815

same here on MacBook Air M1

docker build -t local/invokeai:latest -f Dockerfile.cloud ..
[+] Building 82.0s (12/13)                                                                                                                                                          
 => [internal] load build definition from Dockerfile.cloud                                                                                                                     0.0s
 => => transferring dockerfile: 1.49kB                                                                                                                                         0.0s
 => [internal] load .dockerignore                                                                                                                                              0.0s
 => => transferring context: 513B                                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/ubuntu:22.04                                                                                                                5.3s
 => [auth] library/ubuntu:pull token for registry-1.docker.io                                                                                                                  0.0s
 => [internal] load build context                                                                                                                                              0.2s
 => => transferring context: 11.49MB                                                                                                                                           0.2s
 => [builder 1/5] FROM docker.io/library/ubuntu:22.04@sha256:4b1d0c4a2d2aaf63b37111f34eb9fa89fa1bf53dd6e4ca954d47caebca4005c2                                                  0.0s
 => [runtime 2/4] RUN apt update && apt install -y     git     curl     ncdu     iotop     bzip2     libglib2.0-0     libgl1-mesa-glx     python3-venv     python3-pip     &  60.1s
 => [builder 2/5] RUN --mount=type=cache,target=/var/cache/apt     apt update && apt install -y     libglib2.0-0     libgl1-mesa-glx     python3-venv     python3-pip         70.3s
 => [runtime 3/4] WORKDIR /invokeai                                                                                                                                            0.0s
 => [builder 3/5] WORKDIR /invokeai                                                                                                                                            0.0s
 => [builder 4/5] COPY . .                                                                                                                                                     0.1s
 => ERROR [builder 5/5] RUN --mount=type=cache,target=/root/.cache/pip     cp installer/py3.10-linux-x86_64-cuda-reqs.txt requirements.txt &&     python3 -m venv /invokeai/.  6.2s
------                                                                                                                                                                              
 > [builder 5/5] RUN --mount=type=cache,target=/root/.cache/pip     cp installer/py3.10-linux-x86_64-cuda-reqs.txt requirements.txt &&     python3 -m venv /invokeai/.venv &&    pip install --extra-index-url https://download.pytorch.org/whl/cu116         torch==1.12.0+cu116         torchvision==0.13.0+cu116 &&    pip install -r requirements.txt &&    pip install -e .:                                                                                                                                                                           
#12 1.939 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116                                                                                       
#12 6.123 ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu116 (from versions: 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0)                                
#12 6.124 ERROR: No matching distribution found for torch==1.12.0+cu116                                                                                                             
------                                                                                                                                                                              
executor failed running [/bin/sh -c cp installer/py3.10-linux-x86_64-cuda-reqs.txt requirements.txt &&     python3 -m venv ${VIRTUAL_ENV} &&    pip install --extra-index-url https://download.pytorch.org/whl/cu116         torch==1.12.0+cu116         torchvision==0.13.0+cu116 &&    pip install -r requirements.txt &&    pip install -e .]: exit code: 1
make: *** [build] Error 1

mauwii avatar Nov 24 '22 17:11 mauwii

@damian0815 sorry if I already asked this, do you have M1 as well?

mauwii avatar Nov 24 '22 17:11 mauwii

@damian0815 sorry if I already asked this, do you have M1 as well?

yes i do!

damian0815 avatar Nov 24 '22 17:11 damian0815

Then I guess it would be buildable by f.e. do a export DOCKER_DEFAULT_PLATFORM=x86_64 before executing make (or do it like DOCKER_DEFAULT_PLATFORM=x86_64 make), but would still be far away from a useable container (where the other one from my Dockerfile is imho also unusable with 10s/it :D )

mauwii avatar Nov 24 '22 18:11 mauwii

@ebr unfortunately this fails runpod - which is to be expected in fact:

2022-11-24T18:05:08.066546044Z 
2022-11-24T18:05:08.066549603Z Welcome to InvokeAI. This script will help download the Stable Diffusion weight files
2022-11-24T18:05:08.066553593Z and other large models that are needed for text to image generation. At any point you may interrupt
2022-11-24T18:05:08.066557663Z this program and resume later.
2022-11-24T18:05:08.066561453Z 
2022-11-24T18:05:08.066565033Z ** INITIALIZING INVOKEAI RUNTIME DIRECTORY **
2022-11-24T18:05:08.066568873Z Select a directory in which to install InvokeAI's models and configuration files [/root/invokeai]: 
2022-11-24T18:05:08.066572953Z A problem occurred during download.
2022-11-24T18:05:08.066576713Z The error was: "EOF when reading a line"
2022-11-24T18:05:22.525579453Z * Initializing, be patient...
2022-11-24T18:05:22.525611853Z >> Initialization file /root/.invokeai found. Loading...
2022-11-24T18:05:22.525618103Z >> InvokeAI runtime directory is "/invokeai"
2022-11-24T18:05:22.525622723Z ## NOT FOUND: GFPGAN model not found at /invokeai/models/gfpgan/GFPGANv1.4.pth
2022-11-24T18:05:22.525626983Z >> GFPGAN Disabled
2022-11-24T18:05:22.525631023Z ## NOT FOUND: CodeFormer model not found at /invokeai/models/codeformer/codeformer.pth
2022-11-24T18:05:22.525635143Z >> CodeFormer Disabled
2022-11-24T18:05:22.525639303Z >> ESRGAN Initialized
2022-11-24T18:05:22.525656382Z 
2022-11-24T18:05:22.525660922Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-11-24T18:05:22.525665452Z    You appear to have a missing or misconfigured model file(s).                   
2022-11-24T18:05:22.525669622Z    The script will now exit and run configure_invokeai.py to help fix the problem.
2022-11-24T18:05:22.525673672Z    After reconfiguration is done, please relaunch invoke.py.                      
2022-11-24T18:05:22.525677652Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2022-11-24T18:05:22.525681522Z configure_invokeai is launching....
2022-11-24T18:05:22.525685442Z 
2022-11-24T18:05:22.525689152Z Loading Python libraries...
2022-11-24T18:05:22.525692992Z 
2022-11-24T18:05:22.525696592Z Welcome to InvokeAI. This script will help download the Stable Diffusion weight files
2022-11-24T18:05:22.525700742Z and other large models that are needed for text to image generation. At any point you may interrupt
2022-11-24T18:05:22.525704782Z this program and resume later.
2022-11-24T18:05:22.525708562Z 
2022-11-24T18:05:22.525712112Z ** INITIALIZING INVOKEAI RUNTIME DIRECTORY **
2022-11-24T18:05:22.525716062Z Select a directory in which to install InvokeAI's models and configuration files [/root/invokeai]: 
2022-11-24T18:05:22.525720062Z A problem occurred during download.
2022-11-24T18:05:22.525723852Z The error was: "EOF when reading a line"

on runpod there's no way to provide a pre-populated persistent storage, and if the docker image itself doesn't actually launch an openssh server, no way to connect to it to do that after the image is launched.

and at the moment there's no way to configure InvokeAI from the web server alone. this is i think something we need to add but it is not yet on the roadmap.

damian0815 avatar Nov 24 '22 18:11 damian0815

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

mauwii avatar Nov 24 '22 18:11 mauwii

@ebr to make this usable for invoke as-is you'd need to apt install openssh-server when building the image and then run something like this as the run script for the image:

#!/bin/bash

echo "pod started"

if [[ $PUBLIC_KEY ]]
then
    mkdir -p ~/.ssh
    chmod 700 ~/.ssh
    cd ~/.ssh
    echo $PUBLIC_KEY >> authorized_keys
    chmod 700 -R ~/.ssh
    cd /
    service ssh start
    echo "ssh server listening for connections"
else
    echo "no \$PUBLIC_KEY set, not running ssh server"
fi

while true; do
    conda activate # or whatever needs to be done to activate the env
    python scripts/invoke.py --web --host 0.0.0.0 --root $ROOT
    echo "invoke.py exited with exit code $?, sleep for 10 seconds then trying again"
    sleep 10
done

damian0815 avatar Nov 24 '22 18:11 damian0815

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

besides that you could mount a blob storage or network storage wich has been preloaded (but I like the Sidecar way more since I like automation ;P )

mauwii avatar Nov 24 '22 18:11 mauwii

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

i do not think runpod exposes any such thing as a "sidecar container"

Screenshot 2022-11-24 at 19 16 49

the "docker command" field is for the command that runs inside the docker image, eg

Screenshot 2022-11-24 at 19 17 09

damian0815 avatar Nov 24 '22 18:11 damian0815

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

besides that you could mount a blob storage or network storage wich has been preloaded (but I like the Sidecar way more since I like automation ;P )

is network storage feasible given the amount of data involved? the /workspace they give you is actual physical media tied to the machine on which the image is instantiated. when you kick off a pod you actually get tied permanently to a particular machine (which can beome a problem at peak times when you may have no GPUs available on that machine and can no longer resume a stopped instance) but it means that once you've populated your 10GB of model files or whatever, subsequent image startups are instant.

i mean, i have images i've kept mounted (but stopped) for over a month now, just because they have like 15 GB of model files on them and i don't want to sit around re-downloading all of that again

damian0815 avatar Nov 24 '22 18:11 damian0815

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

i do not think runpod exposes any such thing as a "sidecar container"

a dscription what sidecar containers are: https://www.containiq.com/post/kubernetes-sidecar-container

I do not know runpod, I only used Azure Kubernetes Service and minikube where a pod is the instance of a running container, while the deployment itself is much bigger than a pod configuration and usualy done via one or more yaml files (manifests)

mauwii avatar Nov 24 '22 18:11 mauwii

normaly you should do the setup of the local volume via a sidecar container (when working in K8s which runpod sounds like)

besides that you could mount a blob storage or network storage wich has been preloaded (but I like the Sidecar way more since I like automation ;P )

is network storage feasible given the amount of data involved? the /workspace they give you is actual physical media tied to the machine on which the image is instantiated. when you kick off a pod you actually get tied permanently to a particular machine (which can beome a problem at peak times when you may have no GPUs available on that machine and can no longer resume a stopped instance) but it means that once you've populated your 10GB of model files or whatever, subsequent image startups are instant.

i mean, i have images i've kept mounted (but stopped) for over a month now, just because they have like 15 GB of model files on them and i don't want to sit around re-downloading all of that again

depending on where and how you want to host the container, a network storage can of course be a valid option to do so xD (I don't see the difference if I use a Blob Storage, a Network Storage, or any other kind of persistent storage which needs to be mounted to the container....)

mauwii avatar Nov 24 '22 18:11 mauwii

the screenshot above is the entirety of the configuration options i have access to using the runpod web ui

damian0815 avatar Nov 24 '22 19:11 damian0815

like I sayed, this is only a pod configuration. I think you did not look into the description what a sidecar container is, so TLDR: It is another Pod which f.e. can run before running the "main pod", which then could f.e. run the configure_invokeai.py --yes command to create the necesary models on a persistent storage.

mauwii avatar Nov 24 '22 19:11 mauwii

<math lady.gif>

well it turns out that i can in fact start the instance, stop it, change the docker image, and start it again. is that what you mean? i don't think i have any other way of sharing persistent storage between two different docker images.

so you're suggesting i should make another dockerfile which builds a python environment with the dependencies for configure_invokeai.py (which are basically the same as those for InvokeAI as a whole, except pytorch can probably run on CPU), which runs configure_invokeai.py to download the models. then i stop it, switch out the dockerfile to @ebr's, then start it again?

damian0815 avatar Nov 24 '22 20:11 damian0815

@ebr in fact it turns out i would like to be able to do this:

Screen Shot 2022-11-24 at 21 45 09

but it seems your dockerfile ignores this and runs invokeai.py instead. i read about this somewhere, something about setting ENTRYPOINT vs setting CMD ..?

damian0815 avatar Nov 24 '22 20:11 damian0815

What I suggested is to read https://www.containiq.com/post/kubernetes-sidecar-container ;P

mauwii avatar Nov 24 '22 20:11 mauwii

What I suggested is to read https://www.containiq.com/post/kubernetes-sidecar-container ;P

i did but it all went over my head. hence math-lady

damian0815 avatar Nov 24 '22 21:11 damian0815

@mauwii @damian0815 thank you both for the review and testing! Let me see if I can cover all comments:

  • I would not expect it to work on Mac in the current state. In the current Dockerfile it's explicitly going for the linux-cuda install only. but I think it should be possible to make it flexible enough. I'll try to come up with some suggestions since I don't have an M1 to test on myself.
  • understood about the openssh issue on Runpod. I think there's a way to work around this. The Dockerfile provides a multi-stage build. We can add another stage that bundles the openssh server, sets up the volume, and then the actual "worker" container uses the volume. This might work, I will try it. @damian0815 your script is a helpful start, thanks. You're on the right track re: sharing a volume between 2 containers. If that's indeed possible on Runpod, then indeed it should work. And yes, ENTRYPOINT needs to be changed.
  • "container tied to a specific machine" - that sounds problematic... How did you manage to keep multiple containers in a stopped state if the machines might go away? do you use spot machines or only on-demand?
  • on K8s it works quite differently. I've come up with a way to rehydrate local node storage from S3 by using a separate service, and mount that as a hostPath volume into the pods. A sidecar approach works, but is inefficient. And yeah, network storage (especially Azure blob) is way too slow - it needs to be local. But anyway, that K8s code is nowhere near ready to be shared, so let's leave that for later! (If that sounds like gibberish, don't worry about it, we'll get to it another time :D)

I'll sink my teeth into Runpod now and report back.

ebr avatar Nov 25 '22 04:11 ebr

Well, after the K8s Trainings I atended I cannot say that sidecars are inefficent, of course they can be, but when used clever they can be very efficent and save you lot of work ;)

for example to check if updated code is available and pull if yes, or create backups on the fly, .... many reasons to use sidecars :P

mauwii avatar Nov 25 '22 04:11 mauwii

I only mean they are inefficient for this specific application (maintaining an external, shared model cache that needs to exist beyond the lifecycle of a pod), because if you have multiple pods using the same template, then the sidecars will experience race conditions and unnecessarily thrash both remote and local storage. This is much better handled by a daemonset. But yes, sidecars certainly have many great uses! In case of invokeai running on k8s, a sidecar might be useful for periodically syncing an instance's output results to S3, given each instance is running in its own pod. Depends on the design. I think with the upcoming changes to the API this will need some re-thinking. But this is quite offtopic for the current PR (I'd love to continue the convo elsewhere/another time though!)

ebr avatar Nov 25 '22 05:11 ebr

So in my current container I mount a volume to /data where all of the models and outputs are stored, don't see a reason why this volume should not be able to be shared between more pods. you can of course also mount a storage which is preloaded with the models and dont use the sidecar, but advantage of the sidecar to update the models when invokeai gets another update is prety obvious to me 🙈

mauwii avatar Nov 25 '22 05:11 mauwii

so, (again, offtopic :wink: but i can't help being baited by k8s talk): consider a case where multiple application pods are running on a multi-GPU node, and each such pod includes the sidecar. Yes, all pods will have access to the hostPath volume, but all these sidecars will 1) try to sync and trip over each other, and 2) waste compute cycles doing so. This design benefits much better from using a DaemonSet where a single "cache manager" pod runs per node, has a lifecycle independent of the application pods, and doesn't require modifications to the application's deployment. Your approach works, but can be problematic for sharing the model cache across deployments. It's fine for having a dedicated external cache per pod, which is I think the use case you're describing. :slightly_smiling_face:

ebr avatar Nov 25 '22 05:11 ebr

Why would I include the sidecar in the pod - the sidecar is a seperate pod, called via a cronjob?! And why imagine another application than invoke-ai But I aggre that it depends on the configuration of your manifest if the sidecar is effective or inefficent I could f.E. even have a sidecar with a complete different base image than the main pod, so I do not understand most of these arguments 🤔 and a deamon-set is helpful when I have something where I need to run a container on every node, while a deployed manifest could be limited to only aarch64 nodes f.e (or SSD nodes)., and maybe you have only one of them, while the others are amd64 (or HDD nodes), then a deamon-set would be pretty unecessary. And even if your application scaled up to run 5 pods to be able to answer all requests, the sidecar would only need to run in one single pod, not 5.... But it of course always depends on the usecase what makes sense and what doesn't make sense.

since we are here in an invoke-ai PR, I like to think about the usecase for a sidecar to be used with the invokeai manifest ;P

and no, I mean having one storage for persistence, maybe per node, but not per pod.

mauwii avatar Nov 25 '22 05:11 mauwii

a sidecar is, by definition, a container that runs inside the same pod next to the "primary" container :) (but there's nothing that inherently makes a container a "sidecar" or "primary" - they are equal from the pod design standpoint). If you're talking about a cronjob, then no, it's not a sidecar (plus you'd need to deal with some anti-affinity rules to ensure that only one pod of the cronjob runs on each node, to avoid conflicts).

We can perhaps discuss this elsewhere. If you have any k8s questions, I'll be glad to answer them :)

it might be easier to discuss once I push my k8s manifests for deploying a complete solution. But I'm thinking of packaging them as a Helm chart first, for flexibility.

(and yes, by "application" I only mean InvokeAI; just used to talking in general terms when designing infrastructure)

ebr avatar Nov 25 '22 05:11 ebr

image

https://www.containiq.com/post/kubernetes-sidecar-container

Share same pod? how can one pod have more entrypoints? no main pods? Wonder why you name your pods in the manifest ....

Well, nevermind ....

mauwii avatar Nov 25 '22 06:11 mauwii

Share same pod? how can one pod have more entrypoints?

In brief: a pod is comprised of one or more containers. Each container has its own entrypoint, args, environment, etc. containers share the network namespace, and they may cross-mount volumes that are defined in the pod spec. Multiple containers run in a pod. Usually one is designated as the main workload container, and the others are "sidecars".

Anyway, like I said, I don't think this is the place to argue about this. With all due respect, please feel free to check my LinkedIn if you're not yet fully convinced I know what I'm talking about, and I'd be more than happy to offer you a tutorial elsewhere if you have any questions at all about k8s stuff :smile:

...

With that out of the way: I just added one commit, and the latest image works wonderfully on Runpod (including initializing and using external storage mounted at /workspace, as @damian0815 suggested). I'll do a writeup tomorrow because it's late and I'm wiped :sleeping:

ebr avatar Nov 25 '22 06:11 ebr

this sounds rad @ebr and i can't wait to try it

re: "tied to a specific mean" i think i remember that they said that, architecturally, they reserve the right to uplift a /workspace folder from one machine and plonk it down in another. if they have indeed done this on one of my stopped instances i have not noticed.

damian0815 avatar Nov 25 '22 11:11 damian0815

@damian0815 I also updated the PR description, but: https://runpod.io/gsc?template=vm19ukkycf&ref=mk65wpsa - give this template a try (see README). I ran it through some paces, and it's basically a 2-click process now :laughing: I think if the config script can be made fully non-interactive (see https://github.com/invoke-ai/InvokeAI/issues/1536), then this can be even made into a one-click endeavour. Also I see great possibilities in rehydrating the /workspace directly from S3 or other cloud object stores they support. Anyhow, let me know what you think!

uplift a /workspace folder from one machine and plonk it down in another

okay, that makes total sense, and is actually great, I was a bit concerned about that.

ebr avatar Nov 25 '22 13:11 ebr