cog
cog copied to clipboard
Avoid re-pushing large image files on `cog push`
Since code and weights are stored in the same Docker layer, making small changes can result in really long push times when model weights are in the hundreds of megabytes or gigabytes.
One potential solution could be a build.model_files
option in cog.yaml, that saves those files in a separate layer before the code is added.
Build times are also affected by large weights files. Any change in the code or dependencies requires all files in the working directory (including giant ones) to be re-added to the image.
Perhaps we should automatically add any files >1GB to the image before installing system dependencies and Python packages? That should make both builds and pushes faster since the layer containing the large weights file will be cached both locally and remotely.
Semi-related: https://github.com/replicate/cog/issues/209
I think specifying the model files in the configuration works well and could maybe even allow for some other functionality such as a "fork" of a project that only changes its build.model_files
but uses the same Predictor.
I also agree that a simple fix in the meantime would be to just detect the bigger files and add those automatically. Could also do lookup via suffix (e.g. .pt
,.plk
, .pth
) or check the date modified attribute on the file.
I like the idea of having a pre/post build stage for model copying (or just detecting ckpt/etc, like above). Specifying build.model_files
also seems fairly ergonomic and friendly for new people.
In terms of implementation, Docker itself doesn't deal very well with symlinks, otherwise we could just make and break symlinks when we needed to build the images. (I tried to do ln -s ~/.cache/huggingface ./models
, but it seems to be ignored.)
Thoughts on configuration vs convention? Picking a directory name like 'models' and using some of the built in Dockerfile functionality, like COPY [^models]/*
seems like a low-effort solution, but also might break in the cases where people have their own implementations with a 'models' directory.
A potential quick fix discussed with @anotherjesse IRL last week: when generating the Dockerfile, find all the files larger than, say, 100MB and add those in a layer before COPY . /src
.
That'll mean the last layer is small and will be the only thing that changes if you change source code.
In discussion with @andreasjansson this morning, it appears the lack of exclude
capabilities in dockerfile COPY
will limit complicate the "quick fix" (@bfirsh unless you of any magic we can sprinkle on)
COPY big-file /src/big-file
COPY . /src
will result in big-file
being included in both COPY
layers.
We discussed ideas around building with multiple stages (and switching .dockerignore
between the first and second layer):
Dockerfile
ARG BASE_IMAGE
FROM $BASE_IMAGE
COPY . /src
build.sh
#!/bin/bash
set -o errexit
set -o xtrace
echo "Build the base + weights"
BASE_IMAGE="alpine:3.7"
echo "ignore all the small files, copy big files"
find . -type f -size -10M > .dockerignore
docker build --build-arg BASE_IMAGE=$BASE_IMAGE -t base .
BASE_ID=$(docker inspect base:latest --format='{{index .Id}}')
echo "ignore all the big files, copy copy files"
find . -type f -size +10M > .dockerignore
docker build --build-arg BASE_IMAGE=$BASE_ID -t final .
FINAL_ID=$(docker inspect final:latest --format='{{index .Id}}')
echo "Final image: $FINAL_ID"
- docker build 1
- dockerignore: all small files
- run docker build
- docker build 2
- dockerignore: all large files
- run docker build base of build 1
Running it results 1G uploaded to first docker build, 26K uploaded to second:
$ ./build.sh
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon 1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> 6d1ef012b567
Step 3/3 : COPY . /src
---> 8f98949d7815
Successfully built 8f98949d7815
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7 -t final .
Sending build context to Docker daemon 26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> 8f98949d7815
Step 3/3 : COPY . /src
---> 698b74501826
Successfully built 698b74501826
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
+ echo 'Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b'
Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
Re-running seems to work with cache!
$ ./build.sh
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon 1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> 6d1ef012b567
Step 3/3 : COPY . /src
---> Using cache
---> 8f98949d7815
Successfully built 8f98949d7815
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7 -t final .
Sending build context to Docker daemon 26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> 8f98949d7815
Step 3/3 : COPY . /src
---> Using cache
---> 698b74501826
Successfully built 698b74501826
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
+ echo 'Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b'
Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
Making a change to our small file:
$ date >> small
If it uses the cache to on the first docker build, this might be a solution:
$ ./build.sh
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon 1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> 6d1ef012b567
Step 3/3 : COPY . /src
---> Using cache
---> b657bb157e9e
Successfully built b657bb157e9e
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:b657bb157e9e138ef48786a414600e5c6d5980077b3c1ad11159c04d27eb73c2
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:b657bb157e9e138ef48786a414600e5c6d5980077b3c1ad11159c04d27eb73c2 -t final .
Sending build context to Docker daemon 26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
---> b657bb157e9e
Step 3/3 : COPY . /src
---> 9b0b60ed5f24
Successfully built 9b0b60ed5f24
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b
+ echo 'Final image: sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b'
Final image: sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b
Yay, it used the cache:
First run:
Successfully built b657bb157e9e. # base image + weights
Second run after changing small
Successfully built b657bb157e9e. # base image + weights
Forgive for bumping this but is there a plan for this?
I assume this will also reduce some burden on replicate infra too.
Similar case here, and so a similar problem.
I sometimes use Replicate to build quick ideas, because I don't have access to local GPU that can run the kind of stuff I'm building.
But it's annoying to have to wait for like 10 minutes on each code change while I cannot really test those changes locally.
Some build.model_files
thing would have worked for me.