buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

empty layers cause images built from cache to have different image config

Open amrmahdi opened this issue 2 years ago • 2 comments

The issue can be reproduced using the following repo: https://github.com/amrmahdi/buildkit-cache-issue

Further discussion from Slack: https://dockercommunity.slack.com/archives/C7S7A40MP/p1658360288913549?thread_ts=1656592512.773189&cid=C7S7A40MP

amrmahdi avatar Jul 21 '22 01:07 amrmahdi

Further discussion from Slack: https://dockercommunity.slack.com/archives/C7S7A40MP/p1658360288913549?thread_ts=1656592512.773189&cid=C7S7A40MP

re-transcription just in case as slack messages are limited


Tyler Kalbach

Hello all, I've been having a strange issue with cache invalidation when using docker on our Jenkins worker. Our jenkins worker is an EC2 instance with docker 20.10 installed. When building the docker images I am using the DOCKER_BUILDKIT=1, --build-arg BUILDKIT_INLINE_CACHE=1, --cache-from --cache-from flags. This will work perfectly for 30+ builds but then randomly the cache will get invalidated when the files being copied have not changed (same git commit), and the base image has not changed. I have added every file from the context to make sure something from the context wasn't causing the invalidation but have confirmed only relevant files that are unchanged are in the context. The only other thing I can think of that could be causing the issue is the docker build operation is being run on our jenkins workers and two or more "docker build" commands can be run simultaneously. Could that be the issue; or is there something else that I could be missing?

jed

Does it look quite random? I wonder if it might have something to do with a more up-to-date base image having been pulled, depending on how often you see the problem

Tyler Kalbach

It does appear to be really random. I thought it could be something with the base image as well but layers before the copy command were cached correctly, and I confirmed the base image sha was identical between a previous build which successfully used the cache and the subsequent build that did not use the cache.

jed

ah ok :slightly_smiling_face: what does the COPY command look like? i wonder if something is in the context occasionally, like maybe a log file or something from an earlier stage?

Tyler Kalbach

# syntax=docker/dockerfile:1.4
FROM docker-prod.artifactory.aws.athenahealth.com/athenahealth/dsml/athena-datasci_ai-python-base:master

RUN poetry config virtualenvs.path /opt/venvs

COPY --link shared /opt/repo/builds/shared

ARG BUILD_DIR
COPY --link $BUILD_DIR/pyproject.toml $BUILD_DIR/poetry.lock /opt/repo/builds/$BUILD_DIR/

I've seen the first COPY get invalidated, as well as the first be cached successfully and the second get invalidated. Entirely random. (edited)

Tyler Kalbach

BUILD_DIR remains the same for each build.

jed

hm, i wonder if this could a problem with the --link option

jed

it is quite new, so it's possible it could be a bug

Tyler Kalbach

I was experiencing the issue before and after adding the --link option. I just added the link option to see if it would help.

Tyler Kalbach

Does the cache hashing care about the directories full path? For example lets say I'm in directory /tmp/hash-1/ and I build from this directory, then I move the files to /tmp/hash-2/ and build the same image from this directory will the cache get invalidated? Nevermind just tested that and it doesn't matter. (edited)

Tyler Kalbach

For another data point. Once the cache gets randomly invalidated the next few builds have the same issue.

Tyler Kalbach

@jed If an image is fully built from cache is there a scenario where the repoDigest of the image built would change from the previous built version of the image in which the cached image was built from? My expectation would be that if the image was built entirely from the previous cached image then the repoDigest should remain unchanged, but I am not seeing that in some of my builds, while in others this is the pattern.

Tyler Kalbach

I am using ECR for storing the images so if this is unexpected behavior I can reach out to AWS to see if that have any processes in place that could be the culprit

jed

I'd expect that the images digests don't change - at the very least the layer digests shouldn't be changing

Tyler Kalbach

thanks; I'm definitely seeing the repoDigest is changing but the image was entirely built from cache Also sent to the channel

Amr Mahdi

I'm also seeing something similar happen, but for me it is consistent. I understand why the cache is invalidated but don't understand what led to it. Here is what happens, all the layers are pulled from cache, and I can confirm that all the layer digests are exactly the same as the original built image. However, the config digest is different .... And when I compare the config from the original built image and the rebuilt image using the cache, the only difference is that the created entry of few of the layers under the history entry is different. Something like the following image, haven't been able to repro it locally unfortunately :( What is interesting though that this only happens between build 0 and 1, subsequent builds will produce identical results as build 1. @jed does it ring any bell? image (17)

jed

Aha, you learn something new everyday :slightly_smiling_face: I think this is expected https://github.com/moby/buildkit/blob/master/source/containerimage/pull.go#L343 indicates that we pull the cache key for image sources from the config object in the manifest - which since it contains the diff id hashes for each layer, also changes based on layer changes. If the created entry is different, I would guess that the base image has been rebuilt when it didn't need to be, likely the base image's caching setup is slightly off? What base image are you using? (I think this makes sense not seeing this after the build when this changed, since after that the newer changed image would be in the buildkit cache). (edited)

Amr Mahdi

The base image has been rebuilt yes, but it was fully rebuilt from the cache and the image digest in build 0 and build 1 matches exactly, so I do not see how this will cause the leaf image to change, when even the leaf image layers are built from cache

Amr Mahdi

Also the base image, is an image that we build in the CI pipeline

Amr Mahdi

@tonistiigi any guidance on how to debug this? Even it required adding custom debug code

tonistiigi

is there a reproducer?

tonistiigi

there was an issue with handling empty layer records in history array when cache was imported. we removed empty layer removal feature to fix this in v0.10

Amr Mahdi

Unfortunately it happens only in a complex CI pipeline with large images 4GB+, I'm unable to create a simple repro. I'm actually running with the latest buildkit.

Amr Mahdi

@tonistiigi I was able to create a simple repro. You can find the details here https://github.com/amrmahdi/buildkit-cache-issue

tonistiigi

The case seems to be 2 empty layers after each other

tonistiigi

history array on first image

      "created": "2022-07-20T23:33:05.302279427Z",
      "created_by": "RUN /bin/sh -c mkdir /src # buildkit",
      "comment": "buildkit.dockerfile.v0"
    },
    {
      "created": "2022-07-20T23:33:05.311102635Z",
      "created_by": "COPY /src/ /src/ # buildkit",
      "comment": "buildkit.dockerfile.v0"
    },
    {
      "created": "2022-07-20T23:33:05.31756351Z",
      "created_by": "WORKDIR /code",
      "comment": "buildkit.dockerfile.v0"
    }

tonistiigi

second

    {
      "created": "2022-07-20T23:33:05.302279427Z",
      "created_by": "RUN /bin/sh -c mkdir /src # buildkit",
      "comment": "buildkit.dockerfile.v0"
    },
    {
      "created": "2022-07-20T23:33:05.31756351Z",
      "created_by": "COPY /src/ /src/ # buildkit",
      "comment": "buildkit.dockerfile.v0"
    },
    {
      "created": "2022-07-20T23:33:05.31756351Z",
      "created_by": "WORKDIR /code",
      "comment": "buildkit.dockerfile.v0"
    }

tonistiigi

Open up an issue with these details. we can take a look if anything can be done for it but it isn’t a high priority because it is kind-of double-error from user as empty layers are pointless to start with.

tonistiigi

what is happening (without looking details from the code) is that buildkit realizes that these empty layers have equal state and the empty blob for your workdir command is imported for the copy command. because they are identical, only one blob is exported and therefore the timestamp carries over as well (edited)

Amr Mahdi

I understand the docker file is meaningless, I tried to repro this from a complex graph of docker files dependent on each other. And this happens to an intermediate dockerfile, which causes all docker files caches depending on it to be invalidated.

tonistiigi

If you can modify the repro to not have these equal layers then it would be something different

tonistiigi

you can check that by just looking at the rootfs array in the image

tonistiigi

current

  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:c7412c2a678786efc84fe4bd0b1a79ecca47457b0eb4d4bceb1f79d6b4f75695",
      "sha256:81211b8de33f8d415c90bd3ebed173494fd16af55d5e6822ddf6c88c755391ea",
      "sha256:e96f169af1b3dfd640674cca738e3adca0539c4f06dd4f7d5b5d3bfe3e4f6fd7",
      "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
      "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef"
    ]
  },

Amr Mahdi

I understand that. But as an author to the dockerfile, I'd not expect that an empty layer would ruin my cache ...

tonistiigi

it doesn’t “ruin” your cache. even in your example the layers all get imported correctly. only thing that goes missing is that identical empty layer also will have identical timestamp in history after import

tonistiigi

it is still a bug

Amr Mahdi

When this images is rebuilt and has a different digest, any image building from it, its cache will be invalidated.

tonistiigi

not with buildkit

tonistiigi

buildkit tracks all the layers separately and all your layers are correct

Amr Mahdi

Any tips how to avoid creating such 2 empty layers when authoring dockerfiles? In some cases it may not be obvious at all.

tonistiigi

empty or duplicated layer doesn’t provide anything to the image. basically it needs to be a mistake in a Dockerfile

Amr Mahdi

Opened https://github.com/moby/buildkit/issues/2973

crazy-max avatar Jul 23 '22 01:07 crazy-max

I'm not entirely sure if this is caused by the empty layer or duplicate layer or both acting together.

For the cache layer the two layers are considered identical so they get same timestamp of import.

Something similar can happen by just two layers with the same digest being imported at a different time. The equality check for a cache layer is the content digest and doesn't take into account the timestamp of the creation time.

tonistiigi avatar Jul 26 '22 18:07 tonistiigi