cog icon indicating copy to clipboard operation
cog copied to clipboard

Avoid re-pushing large image files on `cog push`

Open andreasjansson opened this issue 2 years ago • 3 comments

Since code and weights are stored in the same Docker layer, making small changes can result in really long push times when model weights are in the hundreds of megabytes or gigabytes.

One potential solution could be a build.model_files option in cog.yaml, that saves those files in a separate layer before the code is added.

andreasjansson avatar Oct 22 '21 11:10 andreasjansson

Build times are also affected by large weights files. Any change in the code or dependencies requires all files in the working directory (including giant ones) to be re-added to the image.

Perhaps we should automatically add any files >1GB to the image before installing system dependencies and Python packages? That should make both builds and pushes faster since the layer containing the large weights file will be cached both locally and remotely.

andreasjansson avatar May 09 '22 07:05 andreasjansson

Semi-related: https://github.com/replicate/cog/issues/209

zeke avatar May 09 '22 17:05 zeke

I think specifying the model files in the configuration works well and could maybe even allow for some other functionality such as a "fork" of a project that only changes its build.model_files but uses the same Predictor.

I also agree that a simple fix in the meantime would be to just detect the bigger files and add those automatically. Could also do lookup via suffix (e.g. .pt,.plk, .pth) or check the date modified attribute on the file.

afiaka87 avatar May 13 '22 05:05 afiaka87

I like the idea of having a pre/post build stage for model copying (or just detecting ckpt/etc, like above). Specifying build.model_files also seems fairly ergonomic and friendly for new people.

In terms of implementation, Docker itself doesn't deal very well with symlinks, otherwise we could just make and break symlinks when we needed to build the images. (I tried to do ln -s ~/.cache/huggingface ./models, but it seems to be ignored.)

Thoughts on configuration vs convention? Picking a directory name like 'models' and using some of the built in Dockerfile functionality, like COPY [^models]/* seems like a low-effort solution, but also might break in the cases where people have their own implementations with a 'models' directory.

JosephCatrambone avatar Feb 01 '23 17:02 JosephCatrambone

A potential quick fix discussed with @anotherjesse IRL last week: when generating the Dockerfile, find all the files larger than, say, 100MB and add those in a layer before COPY . /src.

That'll mean the last layer is small and will be the only thing that changes if you change source code.

bfirsh avatar Feb 02 '23 03:02 bfirsh

In discussion with @andreasjansson this morning, it appears the lack of exclude capabilities in dockerfile COPY will limit complicate the "quick fix" (@bfirsh unless you of any magic we can sprinkle on)

COPY big-file /src/big-file
COPY . /src

will result in big-file being included in both COPY layers.

We discussed ideas around building with multiple stages (and switching .dockerignore between the first and second layer):

Dockerfile

ARG BASE_IMAGE
FROM $BASE_IMAGE

COPY . /src

build.sh

#!/bin/bash

set -o errexit
set -o xtrace

echo "Build the base + weights"

BASE_IMAGE="alpine:3.7"

echo "ignore all the small files, copy big files"
find . -type f -size -10M > .dockerignore
docker build --build-arg BASE_IMAGE=$BASE_IMAGE -t base .

BASE_ID=$(docker inspect base:latest --format='{{index .Id}}')

echo "ignore all the big files, copy copy files"
find . -type f -size +10M > .dockerignore
docker build --build-arg BASE_IMAGE=$BASE_ID -t final .
FINAL_ID=$(docker inspect final:latest --format='{{index .Id}}')

echo "Final image: $FINAL_ID"
  • docker build 1
    • dockerignore: all small files
    • run docker build
  • docker build 2
    • dockerignore: all large files
    • run docker build base of build 1

Running it results 1G uploaded to first docker build, 26K uploaded to second:

$ ./build.sh 
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon  1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> 6d1ef012b567
Step 3/3 : COPY . /src
 ---> 8f98949d7815
Successfully built 8f98949d7815
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7 -t final .
Sending build context to Docker daemon  26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> 8f98949d7815
Step 3/3 : COPY . /src
 ---> 698b74501826
Successfully built 698b74501826
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
+ echo 'Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b'
Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b

Re-running seems to work with cache!

$ ./build.sh 
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon  1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> 6d1ef012b567
Step 3/3 : COPY . /src
 ---> Using cache
 ---> 8f98949d7815
Successfully built 8f98949d7815
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:8f98949d781581c33b61b513e6885a9adc70b7570264ec8b8b70b654f4690ff7 -t final .
Sending build context to Docker daemon  26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> 8f98949d7815
Step 3/3 : COPY . /src
 ---> Using cache
 ---> 698b74501826
Successfully built 698b74501826
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b
+ echo 'Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b'
Final image: sha256:698b7450182635aa75892684277e47897bcb6b7de5c464f356be5d2b83b07d9b

Making a change to our small file:

$ date >> small 

If it uses the cache to on the first docker build, this might be a solution:

$ ./build.sh 
+ echo 'Build the base + weights'
Build the base + weights
+ BASE_IMAGE=alpine:3.7
+ find . -type f -size -10M
+ docker build --build-arg BASE_IMAGE=alpine:3.7 -t base .
Sending build context to Docker daemon  1.024GB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> 6d1ef012b567
Step 3/3 : COPY . /src
 ---> Using cache
 ---> b657bb157e9e
Successfully built b657bb157e9e
Successfully tagged base:latest
++ docker inspect base:latest '--format={{index .Id}}'
+ BASE_ID=sha256:b657bb157e9e138ef48786a414600e5c6d5980077b3c1ad11159c04d27eb73c2
+ find . -type f -size +10M
+ docker build --build-arg BASE_IMAGE=sha256:b657bb157e9e138ef48786a414600e5c6d5980077b3c1ad11159c04d27eb73c2 -t final .
Sending build context to Docker daemon  26.11kB
Step 1/3 : ARG BASE_IMAGE
Step 2/3 : FROM $BASE_IMAGE
 ---> b657bb157e9e
Step 3/3 : COPY . /src
 ---> 9b0b60ed5f24
Successfully built 9b0b60ed5f24
Successfully tagged final:latest
++ docker inspect final:latest '--format={{index .Id}}'
+ FINAL_ID=sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b
+ echo 'Final image: sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b'
Final image: sha256:9b0b60ed5f247609e36520f4262deead2376492b414596f39733c6a3b7476e4b

Yay, it used the cache:

First run:

Successfully built b657bb157e9e. # base image + weights

Second run after changing small

Successfully built b657bb157e9e. # base image + weights

anotherjesse avatar Feb 06 '23 16:02 anotherjesse

Forgive for bumping this but is there a plan for this?

I assume this will also reduce some burden on replicate infra too.

asadm avatar Mar 19 '23 18:03 asadm

Similar case here, and so a similar problem.

I sometimes use Replicate to build quick ideas, because I don't have access to local GPU that can run the kind of stuff I'm building.

But it's annoying to have to wait for like 10 minutes on each code change while I cannot really test those changes locally.

Some build.model_files thing would have worked for me.

kirillrogovoy avatar May 09 '23 07:05 kirillrogovoy