buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Wrong dockerfile used when building two images in the same context folder

Open MichaelKim0407 opened this issue 4 years ago • 32 comments

When building different images with different dockerfiles in the same context, the wrong dockerfile will be used (probably due to caching problems).

I created a minimal reproduction here. Please see README in that repo.

MichaelKim0407 avatar Feb 19 '20 01:02 MichaelKim0407

Interesting. Buildkit uses same method as rsync, based on file metadata, for file transfers. Because you do fast git clone in your test script the files (often) have the same timestamp (with the rest of the metadata) that is causing this.

Eg. you can change your test script to

#!/bin/bash

rm -rf .test
git clone .git .test
cd .test

mkdir foo
rsync a/Dockerfile foo/
rsync b/Dockerfile foo/
cat foo/Dockerfile

for the same behavior.

Need to think about this.

tonistiigi avatar Feb 19 '20 06:02 tonistiigi

Thanks for the explanation. So essentially, the underlying copying method BuildKit is using (i.e. rsync) thinks they are the same file because it only checked for timestamp and file size?

For context, git clone is a realistic scenario because I discovered this problem in CI builds.

In this case I think it's fair to ask why BuildKit is trying to copy apparently different files into the same destination?

I played around the test script a little bit, and it seems that if I change the build path to an absolute path (/path/to/.test/ instead of .), the error will go away. But if I change both commands to use absolute paths, the error reappears.

I have no idea of BuildKit's implementation, but it makes me speculate:

  • BuildKit is rsyncing the dockerfile and ignore file into a temporary folder before building
  • The path of the temporary folder is deterministic, so subsequent builds can reuse assets
  • The path calculation takes the literal value of build path into consideration

If my speculations are true, I would make the following suggestions:

  • This has nothing to do with the bug itself, but wouldn't it make sense to always calculate the temporary path from absolute paths?
  • When rsyncing the dockerfile and ignore file, check absolute_path(--file) instead, because if the value is different, we know for sure they are different files, and rsyncing them into the same destination doesn't make sense.

Of course these are my speculations, as I haven't looked into BuildKit's source code yet.

Although I'm not a Go developer, I'll be happy to contribute if you can confirm these and think the changes will be simple enough. Please let me know how you want to proceed!

MichaelKim0407 avatar Feb 19 '20 17:02 MichaelKim0407

I played around the test script a little bit, and it seems that if I change the build path to an absolute path (/path/to/.test/ instead of .), the error will go away. But if I change both commands to use absolute paths, the error reappears.

Yes, there are cases where cli side will copy the Dockerfile before sending. This would trigger new timestamp and this issue doesn't appear.

Although this seems like a simple fix it's not really a good idea to always force a copy like this on the cli side as it would make Dockerfile an exceptional case. What if build needs another file (dockerignore, or something new later), it would hit the same issue.

We could make it possible for the frontend to choose what algorithm to use for specific sources. Eg. dockerfile is always small so wouldn't be wasteful to always copy it, or do a full crypto checksum over data.

The transfer logic itself lives in https://github.com/tonistiigi/fsutil repository.

tonistiigi avatar Feb 21 '20 00:02 tonistiigi

I'm not quite following.. Are files always sent to the same folder, no matter what is being built? Or are they sent to a temporary folder that has some re-use mechanism?

MichaelKim0407 avatar Feb 21 '20 01:02 MichaelKim0407

They are sent to folders when they can be reused after a build has completed. Frontend can control the index mechanism for different files, eg. Dockerfiles go to difference place than build context. There is also separation based on workdir for different project not to collide and make rsync meaningless (don't remember if docker cli implements this).

tonistiigi avatar Feb 21 '20 02:02 tonistiigi

Just to be clear, frontend is the client (docker cli), and backend is the builder (BuildKit)?

And frontend decides where files go to, and backend decides how to copy the files?

(Sorry I'm not a docker developer.. I'm not sure what index mechanism means here)

I know BuildKit is experimental but I'm using BuildKit specifically for the Dockerfile.dockerignore behavior which I find very useful for building multiple images in the same project. All I know about docker builds is that files are being sent somewhere, so that builds don't actually run with the source files.

I would say from an end user perspective that, while making things fast is nice, reliability is more important for building images. I don't expect my image to break because the build process somehow used wrong files.

Thanks for the explanation, though!

MichaelKim0407 avatar Feb 21 '20 02:02 MichaelKim0407

And frontend decides where files go to, and backend decides how to copy the files?

Frontend is a high-level builder component. Eg. dockerfiles are built with Dockerfile frontend, buildpacks with buildpack frontend. Buildkit core is lower-level API shared by frontends. Frontends are daemon side and may run in containers.

tonistiigi avatar Feb 21 '20 02:02 tonistiigi

Ok, makes sense. Thank you for your patience.

Although I'd very much love to help, it appears that I don't know the process well enough to be able to contribute anything meaningful. So I'll stop bothering you with questions..

Right now my workaround (for anyone else who runs into this) is using whitespaces and comments to make sure that different files don't have the same size

MichaelKim0407 avatar Feb 21 '20 16:02 MichaelKim0407

This hit me aswell in a CI environment. At least I think this was the reason. I've changed a file in repository that's ignored in .dockerignore and then I was building many images with docker-compose. The result was an image that was named service-a but used the dockerfile of service-b. That can be really dangerous as it might not be obvious and not produce errors until the image is started and does strange/unexpected things.

WolfspiritM avatar Oct 01 '20 15:10 WolfspiritM

@tonistiigi As a workaround this issue we've been doing docker builder prune in between builds where Dockerfiles have same size and timestamp (but different contents). However this slows down builds and eliminates caching.

Can you think of a nicer way to work around this? I suppose I could force the Dockerfiles to have all different sizes and timestamps but would there be an easier work around?

russellcardullo avatar Nov 20 '20 23:11 russellcardullo

Hitting this one as well. Becomes specially severe when using monolithic repositories with many applications where the likelihood of this happening increases exponentially :(

nocive avatar Nov 23 '20 17:11 nocive

Can you think of a nicer way to work around this? I suppose I could force the Dockerfiles to have all different sizes and timestamps but would there be an easier work around?

Would also love to hear about cleaner ways to work around this for the time being. I've tried virtually everything, including using contextual .dockerignore file for each app with something like:

**/Dockerfile
!/this/app/Dockerfile

But that also doesn't seem to do be taken into account when the copy of the cached dockerfile happens.

Ensuring each Dockerfile has a unique filesize is error prone and does not guarantee the issue can't occur anyway when the dockerfiles are changed by multiple developers switching in between different branches.

nocive avatar Nov 26 '20 08:11 nocive

For anyone interested, a more elegant way of working around the issue for the time being is to pass your dockerfile contents via stdin with cat Dockerfile | docker build -f - ., in which case the cache issue will not occur.

Unfortunately, such a solution is not suitable if you have no way to intercept your docker build wrapper (eg: docker-compose).

cc @russellcardullo

nocive avatar Dec 03 '20 14:12 nocive

To be honest, as i also found that workaround of streamlining the Dockerfile in moby/moby#41743 , i also found a way to hijack compose with this pseudo code, to factorise in the meantime

service=mysuperservice
# with yq3
buildargs=$(docker-compose config\
    |docker run -i --rm mikefarah/yq:3 yq r - services.${service}.build.args -j\
    |docker run -i --rm imega/jq ".|to_entries[]|\" --build-arg \(.key)=\(.value)\"" -j)
# with latest yq
buildargs=$(docker-compose config\
    |docker run -i --rm mikefarah/yq e ".services.${service}.build.args" - -j\
    |docker run -i --rm imega/jq ".|to_entries[]|\" --build-arg \(.key)=\(.value)\"" -j)

cat Dockerfile | docker build -t mysuperimage $buildargs -f - .

kiorky avatar Dec 03 '20 14:12 kiorky

I hit this on day two of a new project, trying to setup builds for multiple different linux distros for testing. I have a setup something like so:

ubuntu1604/Dockerfile
ubuntu1804/Dockerfile
ubuntu2004/Dockerfile

all of which are identical except for the FROM line, which (because it varies by the tag on Canonical's images) are all the same length.

I suspect this is a pretty common pattern for those who are using containers to do multi-platform testing.

We could make it possible for the frontend to choose what algorithm to use for specific sources. Eg. dockerfile is always small so wouldn't be wasteful to always copy it, or do a full crypto checksum over data.

I would welcome an option like this. Something to the effect of --syncmode=checksum. The raw data going into an image is often very small, and hashing it is easily worth the cost. I would argue for checksum as the default and timestamp + size (call it --syncmode=stat?) as an option for those who want to avoid hashing very large files.

sirosen avatar Dec 22 '20 21:12 sirosen

I find this bug highly annoying, because it messes up the cache in a very unobvious way and is hard to debug. Timestamp are obviously messed up by git, but there is not much to do in a CI environment. Any chance this could by resolved?

petrzjunior avatar Jan 06 '21 08:01 petrzjunior

@petrzjunior it doesn't seem this is getting the attention it deserves.

I would say that more than annoying, this is deeply problematic as it can lead to building the completely wrong application artifact. Would also argue this isn't such an uncommon use case or unlikely to happen if you're dealing with multi-app repositories since one can easily end up with Dockerfiles with the same filesize even if they have different contents and, like you stated, timestamps are messed up by git which also leads to all files having the same timestamp.

What we ended up doing in some cases was marking problematic files (those that share the same build context) with a well known string (eg: BUILDKIT_ISSUE_1368), iterate over all the found files and give them a unique timestamp, ie:

files="$(git grep -l BUILDKIT_ISSUE_1368 '**/Dockerfile')"
d=$(date +%s)
i=0
for file in $files; do
    i=$(( i + 1 ))
    #echo "patching $file"
    touch -d @$(( d + i )) "$file"
done

nocive avatar Jan 20 '21 13:01 nocive

I stumble upon this bug today, didn't realize that the build was actually done with the wrong Dockerfile (and wrong base image - wrong arch) until I pull and run it. Here is my repo gh action run with the problem here and here

Agree with @petrzjunior that this is annoying and could easily overlooked at, since the build was a success. +1 on @sirosen idea of option to use checksum, @nocive idea is nice one too.

Regards,

Martinus

martinussuherman avatar Apr 06 '21 11:04 martinussuherman

Seems COMPOSE_DOCKER_CLI_BUILD=1 is now the default behavior for docker-compose, which makes you wonder how many more people will start seeing this. cc @shin- @aiordache

nocive avatar Apr 09 '21 08:04 nocive

In my opinion this can even result in a security problem if an internal image and an external image are build at the same time resulting in the data from the internal image being published to the external place (as a simple example two containers hosting internal and external documentation with two seperate Dockerfiles that are build at the same time). I moved away from buildkit again cause of this issue cause I need to be sure that the images are actually the ones specified via Dockerfile.

WolfspiritM avatar Apr 09 '21 08:04 WolfspiritM

Thanks for fixing this! Highly appreciated.

Can we already benefit from this fix or is there some follow-up needed downstream in docker?

Experience report: We have a monorepo where many Docker images are built with the full monorepo as context and very similar Dockerfiles, so we were running into this from time to time. Most of our tests run before building the Docker images so they do not catch this. It is caught by a smoke test, but at that point in time, the wrong Docker images are already pushed to various registries, so we are in a pretty bad situation which currently requires manual intervention. Unfortunately this has led us loose some trust in Docker.

Toxaris avatar Mar 24 '22 08:03 Toxaris

@Toxaris I believe this should be fixed in docker engine 20.10.14, since it updates buildx to 0.8.1 (buildx 0.8.0 ships buildkit fdecd0ae108b according to release notes https://github.com/docker/buildx/releases/tag/v0.8.0).

Not sure about earlier docker versions but it doesn't seem like it, since 20.10.13 still ships buildx 0.6.1, which I believe does not include that fix yet.

nocive avatar Mar 31 '22 08:03 nocive

I'm still able to reproduce this bug on Docker 20.10.16 with buildx 0.8.2 using @MichaelKim0407's https://github.com/MichaelKim0407/buildkit-wrong-dockerfile:

ubuntu@ip-10-12-49-184:/tmp/buildkit-wrong-dockerfile$ docker info | egrep '(Server Version:|buildx:)'
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
 Server Version: 20.10.16
ubuntu@ip-10-12-49-184:/tmp/buildkit-wrong-dockerfile$ ls -l {a,b}/Dockerfile
-rw-rw-r-- 1 ubuntu ubuntu 58 May 13 22:35 a/Dockerfile
-rw-rw-r-- 1 ubuntu ubuntu 58 May 13 22:35 b/Dockerfile
ubuntu@ip-10-12-49-184:/tmp/buildkit-wrong-dockerfile$ diff -u {a,b}/Dockerfile
--- a/Dockerfile        2022-05-13 22:35:49.293369884 +0000
+++ b/Dockerfile        2022-05-13 22:35:49.293369884 +0000
@@ -1,4 +1,4 @@
 FROM busybox
 WORKDIR /src
-COPY ./a/test.txt ./
+COPY ./b/test.txt ./
 CMD find .
ubuntu@ip-10-12-49-184:/tmp/buildkit-wrong-dockerfile$ DOCKER_BUILDKIT=1 docker build -f a/Dockerfile .
[+] Building 0.3s (8/8) FINISHED
 => [internal] load build definition from Dockerfile                                                                            0.0s
 => => transferring dockerfile: 110B                                                                                            0.0s
 => [internal] load .dockerignore                                                                                               0.0s
 => => transferring context: 2B                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/busybox:latest                                                               0.0s
 => [1/3] FROM docker.io/library/busybox                                                                                        0.0s
 => [internal] load build context                                                                                               0.0s
 => => transferring context: 66B                                                                                                0.0s
 => CACHED [2/3] WORKDIR /src                                                                                                   0.0s
 => CACHED [3/3] COPY ./a/test.txt ./                                                                                           0.0s
 => exporting to image                                                                                                          0.0s
 => => exporting layers                                                                                                         0.0s
 => => writing image sha256:c1a136a7b84e3f3d73b69888b9c4cbf11b59f5228bdc84b0ea8d8d47b4f99926                                    0.0s
ubuntu@ip-10-12-49-184:/tmp/buildkit-wrong-dockerfile$ DOCKER_BUILDKIT=1 docker build -f b/Dockerfile .
[+] Building 0.1s (7/7) FINISHED
 => [internal] load build definition from Dockerfile                                                                            0.0s
 => => transferring dockerfile: 111B                                                                                            0.0s
 => [internal] load .dockerignore                                                                                               0.0s
 => => transferring context: 2B                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/busybox:latest                                                               0.0s
 => [1/3] FROM docker.io/library/busybox                                                                                        0.0s
 => [internal] load build context                                                                                               0.0s
 => => transferring context: 2B                                                                                                 0.0s
 => CACHED [2/3] WORKDIR /src                                                                                                   0.0s
 => ERROR [3/3] COPY ./a/test.txt ./                                                                                            0.0s
------
 > [3/3] COPY ./a/test.txt ./:
------
failed to compute cache key: "/a/test.txt" not found: not found

This is a real bummer in our situation where we are building a number of similar images out of a git repo (many Dockerfiles w/ same timestamp) where the one element of the Dockerfile that is different is say a ENV RUBY_VERSION 2.7.6 vs. ENV RUBY_VERSION 3.1.2 so you get the wrong version installed. 😬

idleyoungman avatar May 13 '22 22:05 idleyoungman

@tonistiigi isn't this supposed to be fixed? Could you please provide some info here?

nocive avatar May 16 '22 08:05 nocive

I can confirm: this is not fixed in latest Docker Desktop for Mac (Docker Desktop 4.8.2 (79419), engine version 20.10.14).

docker builder prune -f works as a workaround

podlesh avatar May 31 '22 09:05 podlesh

Please reopen /cc @thaJeztah @tonistiigi

nocive avatar Jun 02 '22 11:06 nocive

The issue tracker in this repository tracks issues to be fixed in the master/main branch of this repository (which contains the linked fix); looking at the linked PR, https://github.com/moby/buildkit/pull/2081, it looks like the version of BuildKit used in Docker 20.10 does not (yet) have that fix, but perhaps @tonistiigi has insight if it's safe to backport (or best left for the upcoming 22 release of docker engine)

thaJeztah avatar Jun 02 '22 11:06 thaJeztah

@thaJeztah tyvm for your reply!

I was actually convinced the patch had already made it into one of the 20.10.x versions, that's why I was asking for the issue to be reopened.

if it's safe to backport

I believe you've asked that question at least twice in the past and got no answer.

The lack of importance given to this issue is frustrating as it's quite a severe problem, even if the probability of being affected by it is somewhat low. It was originally reported in February 2020 and more than 2 years later there's still no mainstream docker version that includes the fix.

Can we maybe have more eyeballs here and get some additional insights as to when will this actually ship with docker?

nocive avatar Jun 03 '22 09:06 nocive

ping

nocive avatar Jul 13 '22 08:07 nocive

ran into this as well in our build process... what an odd one to track down

mpietras avatar Aug 31 '22 20:08 mpietras