buildkit
buildkit copied to clipboard
empty layers cause images built from cache to have different image config
The issue can be reproduced using the following repo: https://github.com/amrmahdi/buildkit-cache-issue
Further discussion from Slack: https://dockercommunity.slack.com/archives/C7S7A40MP/p1658360288913549?thread_ts=1656592512.773189&cid=C7S7A40MP
Further discussion from Slack: https://dockercommunity.slack.com/archives/C7S7A40MP/p1658360288913549?thread_ts=1656592512.773189&cid=C7S7A40MP
re-transcription just in case as slack messages are limited
Tyler Kalbach
Hello all, I've been having a strange issue with cache invalidation when using docker on our Jenkins worker. Our jenkins worker is an EC2 instance with docker 20.10 installed. When building the docker images I am using the DOCKER_BUILDKIT=1, --build-arg BUILDKIT_INLINE_CACHE=1, --cache-from
--cache-from flags. This will work perfectly for 30+ builds but then randomly the cache will get invalidated when the files being copied have not changed (same git commit), and the base image has not changed. I have added every file from the context to make sure something from the context wasn't causing the invalidation but have confirmed only relevant files that are unchanged are in the context. The only other thing I can think of that could be causing the issue is the docker build operation is being run on our jenkins workers and two or more "docker build" commands can be run simultaneously. Could that be the issue; or is there something else that I could be missing?
jed
Does it look quite random? I wonder if it might have something to do with a more up-to-date base image having been pulled, depending on how often you see the problem
Tyler Kalbach
It does appear to be really random. I thought it could be something with the base image as well but layers before the copy command were cached correctly, and I confirmed the base image sha was identical between a previous build which successfully used the cache and the subsequent build that did not use the cache.
jed
ah ok :slightly_smiling_face: what does the COPY command look like? i wonder if something is in the context occasionally, like maybe a log file or something from an earlier stage?
Tyler Kalbach
# syntax=docker/dockerfile:1.4 FROM docker-prod.artifactory.aws.athenahealth.com/athenahealth/dsml/athena-datasci_ai-python-base:master RUN poetry config virtualenvs.path /opt/venvs COPY --link shared /opt/repo/builds/shared ARG BUILD_DIR COPY --link $BUILD_DIR/pyproject.toml $BUILD_DIR/poetry.lock /opt/repo/builds/$BUILD_DIR/
I've seen the first COPY get invalidated, as well as the first be cached successfully and the second get invalidated. Entirely random. (edited)
Tyler Kalbach
BUILD_DIR remains the same for each build.
jed
hm, i wonder if this could a problem with the --link option
jed
it is quite new, so it's possible it could be a bug
Tyler Kalbach
I was experiencing the issue before and after adding the --link option. I just added the link option to see if it would help.
Does the cache hashing care about the directories full path? For example lets say I'm in directory /tmp/hash-1/ and I build from this directory, then I move the files to /tmp/hash-2/ and build the same image from this directory will the cache get invalidated? Nevermind just tested that and it doesn't matter. (edited)
Tyler Kalbach
For another data point. Once the cache gets randomly invalidated the next few builds have the same issue.
Tyler Kalbach
@jed If an image is fully built from cache is there a scenario where the repoDigest of the image built would change from the previous built version of the image in which the cached image was built from? My expectation would be that if the image was built entirely from the previous cached image then the repoDigest should remain unchanged, but I am not seeing that in some of my builds, while in others this is the pattern.
Tyler Kalbach
I am using ECR for storing the images so if this is unexpected behavior I can reach out to AWS to see if that have any processes in place that could be the culprit
jed
I'd expect that the images digests don't change - at the very least the layer digests shouldn't be changing
Tyler Kalbach
thanks; I'm definitely seeing the repoDigest is changing but the image was entirely built from cache Also sent to the channel
Amr Mahdi
I'm also seeing something similar happen, but for me it is consistent. I understand why the cache is invalidated but don't understand what led to it. Here is what happens, all the layers are pulled from cache, and I can confirm that all the layer digests are exactly the same as the original built image. However, the config digest is different .... And when I compare the config from the original built image and the rebuilt image using the cache, the only difference is that the created entry of few of the layers under the history entry is different. Something like the following image, haven't been able to repro it locally unfortunately :( What is interesting though that this only happens between build 0 and 1, subsequent builds will produce identical results as build 1. @jed does it ring any bell?
jed
Aha, you learn something new everyday :slightly_smiling_face: I think this is expected https://github.com/moby/buildkit/blob/master/source/containerimage/pull.go#L343 indicates that we pull the cache key for image sources from the config object in the manifest - which since it contains the diff id hashes for each layer, also changes based on layer changes. If the created entry is different, I would guess that the base image has been rebuilt when it didn't need to be, likely the base image's caching setup is slightly off? What base image are you using? (I think this makes sense not seeing this after the build when this changed, since after that the newer changed image would be in the buildkit cache). (edited)
Amr Mahdi
The base image has been rebuilt yes, but it was fully rebuilt from the cache and the image digest in build 0 and build 1 matches exactly, so I do not see how this will cause the leaf image to change, when even the leaf image layers are built from cache
Amr Mahdi
Also the base image, is an image that we build in the CI pipeline
Amr Mahdi
@tonistiigi any guidance on how to debug this? Even it required adding custom debug code
tonistiigi
is there a reproducer?
tonistiigi
there was an issue with handling empty layer records in history array when cache was imported. we removed empty layer removal feature to fix this in v0.10
Amr Mahdi
Unfortunately it happens only in a complex CI pipeline with large images 4GB+, I'm unable to create a simple repro. I'm actually running with the latest buildkit.
Amr Mahdi
@tonistiigi I was able to create a simple repro. You can find the details here https://github.com/amrmahdi/buildkit-cache-issue
tonistiigi
The case seems to be 2 empty layers after each other
tonistiigi
history array on first image
"created": "2022-07-20T23:33:05.302279427Z", "created_by": "RUN /bin/sh -c mkdir /src # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-07-20T23:33:05.311102635Z", "created_by": "COPY /src/ /src/ # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-07-20T23:33:05.31756351Z", "created_by": "WORKDIR /code", "comment": "buildkit.dockerfile.v0" }
tonistiigi
second
{ "created": "2022-07-20T23:33:05.302279427Z", "created_by": "RUN /bin/sh -c mkdir /src # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-07-20T23:33:05.31756351Z", "created_by": "COPY /src/ /src/ # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-07-20T23:33:05.31756351Z", "created_by": "WORKDIR /code", "comment": "buildkit.dockerfile.v0" }
tonistiigi
Open up an issue with these details. we can take a look if anything can be done for it but it isn’t a high priority because it is kind-of double-error from user as empty layers are pointless to start with.
tonistiigi
what is happening (without looking details from the code) is that buildkit realizes that these empty layers have equal state and the empty blob for your workdir command is imported for the copy command. because they are identical, only one blob is exported and therefore the timestamp carries over as well (edited)
Amr Mahdi
I understand the docker file is meaningless, I tried to repro this from a complex graph of docker files dependent on each other. And this happens to an intermediate dockerfile, which causes all docker files caches depending on it to be invalidated.
tonistiigi
If you can modify the repro to not have these equal layers then it would be something different
tonistiigi
you can check that by just looking at the rootfs array in the image
tonistiigi
current
"rootfs": { "type": "layers", "diff_ids": [ "sha256:c7412c2a678786efc84fe4bd0b1a79ecca47457b0eb4d4bceb1f79d6b4f75695", "sha256:81211b8de33f8d415c90bd3ebed173494fd16af55d5e6822ddf6c88c755391ea", "sha256:e96f169af1b3dfd640674cca738e3adca0539c4f06dd4f7d5b5d3bfe3e4f6fd7", "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef", "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef" ] },
Amr Mahdi
I understand that. But as an author to the dockerfile, I'd not expect that an empty layer would ruin my cache ...
tonistiigi
it doesn’t “ruin” your cache. even in your example the layers all get imported correctly. only thing that goes missing is that identical empty layer also will have identical timestamp in history after import
tonistiigi
it is still a bug
Amr Mahdi
When this images is rebuilt and has a different digest, any image building from it, its cache will be invalidated.
tonistiigi
not with buildkit
tonistiigi
buildkit tracks all the layers separately and all your layers are correct
Amr Mahdi
Any tips how to avoid creating such 2 empty layers when authoring dockerfiles? In some cases it may not be obvious at all.
tonistiigi
empty or duplicated layer doesn’t provide anything to the image. basically it needs to be a mistake in a Dockerfile
Amr Mahdi
Opened https://github.com/moby/buildkit/issues/2973
I'm not entirely sure if this is caused by the empty layer or duplicate layer or both acting together.
For the cache layer the two layers are considered identical so they get same timestamp of import.
Something similar can happen by just two layers with the same digest being imported at a different time. The equality check for a cache layer is the content digest and doesn't take into account the timestamp of the creation time.