buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

[v0.13] fail to export image when using rewrite-timestamp=true

Open emalihin opened this issue 1 year ago • 15 comments

Hello,

I've been running BuildKit 0.13.0-beta1 for a while as a docker-container driver to get caching ECR registry integration and timestamp rewriting for reproducibility. This worked well for a couple of months.

Today I tried upgrading BuildKit to the stable 0.13.0 and 0.13.1 releases with the same docker-container setup, and started encountering these errors:

...
#18 [10/10] COPY test.conf /etc/test.conf
#18 CACHED
#19 exporting to image
#19 exporting layers done
#19 rewriting layers with source-date-epoch 1 (1970-01-01 00:00:01 +0000 UTC)
------
 > exporting to image:
------
ERROR: failed to solve: content digest sha256:7b722c1070cdf5188f1f9e43b8413157f8dfb2b4fe84db3c03cb492379a42fcc: not found

For some reason jobs that use this cached layer fail to export it. Once I delete the cache from ECR - 1st build passes, and subsequent builds fail.

Everything works again once I revert to using BuildKit 0.13.0-beta1, with new and existing caches.

This is the command I use:

docker buildx build \
--build-arg SOURCE_DATE_EPOCH=1 \
--builder buildkit-6f9724f17cbb483a5f1e \
--provenance=false \
--sbom=false \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name-cache \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:main-cache \
--cache-to mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name \
--output type=registry,rewrite-timestamp=true \
--tag 0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:tmp.4521fd86 .

Edit: similar to this issue, but I'm not doing COPY --link

emalihin avatar Mar 21 '24 16:03 emalihin

Please attach a (minimal and yet self-contained) reproducer

AkihiroSuda avatar Mar 21 '24 20:03 AkihiroSuda

Admittedly I cannot reproduce this in a minimal self-contained environment. Closing the issue until I figure out what's up.

emalihin avatar Mar 25 '24 11:03 emalihin

We have a very similar issue with v0.13.1 with command:

buildctl --debug build --progress plain --frontend dockerfile.v0
              --local context=. --local dockerfile=clients/web/
              --export-cache mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1,ignore-error=true,compression=estargz,rewrite-timestamp=true
              --import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:latest-cache-v0.13.1
              --import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1
              --opt build-arg:ARCH=amd64
              --opt build-arg:REGISTRY=internal-registry.ZZZZZ.net:5000/library/
              --opt build-arg:SOURCE_DATE_EPOCH=1707129707
              --opt target=deploy
              --opt filename=./Storybook.Dockerfile
              -o type=image,"name=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:change-c06b70bf98322eb59b6f58d86ea67ba4,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG",push=true,compression=estargz,rewrite-timestamp=true

Stack trace:

google.golang.org/grpc.getChainUnaryInvoker.func1
13:05:20  	/src/vendor/google.golang.org/grpc/clientconn.go:519
13:05:20  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryClientInterceptor.func1
13:05:20  	/src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:110
13:05:20  github.com/moby/buildkit/client.New.filterInterceptor.func5
13:05:20  	/src/client/client.go:387
13:05:20  google.golang.org/grpc.DialContext.chainUnaryClientInterceptors.func3
13:05:20  	/src/vendor/google.golang.org/grpc/clientconn.go:507
13:05:20  google.golang.org/grpc.(*ClientConn).Invoke
13:05:20  	/src/vendor/google.golang.org/grpc/call.go:35
13:05:20  github.com/moby/buildkit/api/services/control.(*controlClient).Solve
13:05:20  	/src/api/services/control/control.pb.go:2234
13:05:20  github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20  	/src/client/solve.go:274
13:05:20  golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20  	/src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20  runtime.goexit
13:05:20  	/usr/local/go/src/runtime/asm_amd64.s:1650
13:05:20  
13:05:20  github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20  	/src/client/solve.go:290
13:05:20  golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20  	/src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20  runtime.goexit
13:05:20  	/usr/local/go/src/runtime/asm_amd64.s:1650

@emalihin did you find a solution for your problem?

massimeddu-sj avatar Mar 27 '24 13:03 massimeddu-sj

I've rolled back to 0.13.0-beta1 for now. Are both of your --import-cache already populated? I might have missed this step trying to reproduce in an isolated env..

emalihin avatar Mar 27 '24 13:03 emalihin

I've rolled back to 0.13.0-beta1 for now. Are both of your --import-cache already populated? I might have missed this step trying to reproduce in an isolated env..

I see. In this specific example latest-cache-v0.13.1 is populated while TAG-v0.13.1 is not. But I was able to reproduce the issue also with both caches populated.

Actually I notice this start happening after upgrade from moby/buildkit:v0.13.0-beta3 to v0.13.0 (and also I see this in moby/buildkit:v0.13.1, so I'm going to revert to v.0.13.0-beta3 and see if it solves the issue.

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache. Also I build multiple images in parallel, and some images have the issue and some other not, but I'm not able to find any pattern in the behavior.

massimeddu-sj avatar Mar 27 '24 13:03 massimeddu-sj

Might be a regression in https://github.com/moby/buildkit/pull/4663 (v0.13.0-rc2) ?

AkihiroSuda avatar Mar 27 '24 13:03 AkihiroSuda

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.

In my testing of 0.13.0 and 0.13.1 first cache write worked, but first and following cache reads failed. However when I went back to 0.13.1-beta1 - cache reads worked, even with the caches written by 0.13.0 and 0.13.1.

emalihin avatar Mar 27 '24 13:03 emalihin

Might be a regression in #4663 (v0.13.0-rc2) ?

I'm testing v0.13.0-rc1 and currently I'm not seeing the issue (but previously I noticed that could happen after a while, so I'm not 100% secure that rc1 doesn't have this issue).

My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.

In my testing of 0.13.0 and 0.13.1 first cache write worked, but first and following cache reads failed. However when I went back to 0.13.1-beta1 - cache reads worked, even with the caches written by 0.13.0 and 0.13.1.

I didn't check that. Our CI generates different cache tags if we change the buildkit version, so I'm not able to test this scenario.

massimeddu-sj avatar Mar 27 '24 15:03 massimeddu-sj

This PR may fix your issue too

  • https://github.com/moby/buildkit/pull/4804

AkihiroSuda avatar Mar 28 '24 01:03 AkihiroSuda

This PR may fix your issue too

Happy to test it as soon as a build is available. Thank you.

massimeddu-sj avatar Mar 28 '24 10:03 massimeddu-sj

I found a way to repro it on 0.13.2. Dangling (unreferenced) stages appear to cause problems when using rewrite-timestamp=true.

Example Dockerfile:

FROM debian:bookworm as base

FROM debian:bookworm-slim

RUN echo "foo"

Might need to reopen this or file a new issue?

Manbeardo avatar May 11 '24 01:05 Manbeardo

@Manbeardo

Thanks for reporting, but I can't repro the issue with BuildKit v0.13.2 on Ubuntu 24.04

sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --output type=oci,name=build-0,dest=/tmp/build-0.tar,dir=false,rewrite-timestamp=true --no-cache

AkihiroSuda avatar May 11 '24 16:05 AkihiroSuda

Here's a script to more precisely repro what I've seen (using a docker-container buildx driver at v0.13.2):

#!/usr/bin/env bash
set -xeuo pipefail

cd "$(mktemp -d)"

cat >Dockerfile <<'EOF'
FROM debian:bookworm-slim AS base
FROM debian:bookworm
EOF

OUTPUT_IMAGE="<IMAGE FROM PRIVATE REGISTRY>"
CACHE_IMAGE="<DIFFERENT IMAGE FROM PRIVATE REGISTRY>"

docker buildx build \
    --pull \
    --output "name=$OUTPUT_IMAGE,push=true,rewrite-timestamp=true,type=image" \
    --cache-to "ref=$CACHE_IMAGE,image-manifest=true,mode=max,oci-mediatypes=true,type=registry" \
    --platform linux/arm64/v8,linux/amd64 \
    .

That yields this error:

------
 > exporting to image:
------
ERROR: failed to solve: failed to push [REDACTED PRIVATE IMAGE NAME]: content digest sha256:60bdaf986dbe787297bb85c9f6a28d13ea7b9608b95206ef7ce6cdea50cd5505: not found

This can probably be reduced to a more minimal repro, but this was the first config that I got a repro out of as I gradually added configuration from my full build pipeline. My build pipeline had a 100% failure rate when running on my AWS CI builders, but appears to fail inconsistently (race condition?) when I build on my macbook.

Either way, removing the unused stage from the Dockerfile eliminates the error.

Manbeardo avatar May 13 '24 18:05 Manbeardo

could that be related to https://github.com/containerd/containerd/pull/10187 @AkihiroSuda ?

thaJeztah avatar May 13 '24 18:05 thaJeztah

Minimal repro:

sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --no-cache --output type=image,name=localhost:5000/foo,push=true,rewrite-timestamp=true

(Needs push=true && rewrite-timestamp=true)

I also see an occasional panic in the daemon log

WARN[2024-05-15T02:44:50+09:00] rewrite-timestamp is specified, but no source-date-epoch was found  spanID=cbbd84754ca443c5 traceID=a347fa5861b3af65b3508699ef6f1797
panic: send on closed channel

goroutine 671 [running]:
github.com/containerd/containerd/remotes/docker.(*pushWriter).setPipe(...)
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:364
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func1()
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:286 +0xcc
github.com/containerd/containerd/remotes/docker.(*request).do(0xc000d5b5f0, {0x295a620, 0xc000581410})
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:556 +0x162
github.com/containerd/containerd/remotes/docker.(*request).doWithRetries(0xc000d5b5f0, {0x295a620, 0xc000581410}, {0x0, 0x0, 0x0})
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:600 +0x47
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func2()
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:292 +0x57
created by github.com/containerd/containerd/remotes/docker.dockerPusher.push in goroutine 787
        /home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:291 +0x2150

AkihiroSuda avatar May 14 '24 17:05 AkihiroSuda

Any movement or changes to this @AkihiroSuda?

thompson-shaun avatar May 30 '24 19:05 thompson-shaun

Any movement or changes to this @AkihiroSuda?

Not yet, PR is welcome

AkihiroSuda avatar May 31 '24 02:05 AkihiroSuda

PR:

  • #5008

AkihiroSuda avatar Jun 09 '24 17:06 AkihiroSuda