buildkit
buildkit copied to clipboard
[v0.13] fail to export image when using rewrite-timestamp=true
Hello,
I've been running BuildKit 0.13.0-beta1 for a while as a docker-container driver to get caching ECR registry integration and timestamp rewriting for reproducibility. This worked well for a couple of months.
Today I tried upgrading BuildKit to the stable 0.13.0 and 0.13.1 releases with the same docker-container setup, and started encountering these errors:
...
#18 [10/10] COPY test.conf /etc/test.conf
#18 CACHED
#19 exporting to image
#19 exporting layers done
#19 rewriting layers with source-date-epoch 1 (1970-01-01 00:00:01 +0000 UTC)
------
> exporting to image:
------
ERROR: failed to solve: content digest sha256:7b722c1070cdf5188f1f9e43b8413157f8dfb2b4fe84db3c03cb492379a42fcc: not found
For some reason jobs that use this cached layer fail to export it. Once I delete the cache from ECR - 1st build passes, and subsequent builds fail.
Everything works again once I revert to using BuildKit 0.13.0-beta1, with new and existing caches.
This is the command I use:
docker buildx build \
--build-arg SOURCE_DATE_EPOCH=1 \
--builder buildkit-6f9724f17cbb483a5f1e \
--provenance=false \
--sbom=false \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name-cache \
--cache-from type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:main-cache \
--cache-to mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:branch-name \
--output type=registry,rewrite-timestamp=true \
--tag 0000000000.dkr.ecr.us-east-2.amazonaws.com/image-name:tmp.4521fd86 .
Edit: similar to this issue, but I'm not doing COPY --link
Please attach a (minimal and yet self-contained) reproducer
Admittedly I cannot reproduce this in a minimal self-contained environment. Closing the issue until I figure out what's up.
We have a very similar issue with v0.13.1 with command:
buildctl --debug build --progress plain --frontend dockerfile.v0
--local context=. --local dockerfile=clients/web/
--export-cache mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1,ignore-error=true,compression=estargz,rewrite-timestamp=true
--import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:latest-cache-v0.13.1
--import-cache type=registry,unpack=true,ref=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:TAG-v0.13.1
--opt build-arg:ARCH=amd64
--opt build-arg:REGISTRY=internal-registry.ZZZZZ.net:5000/library/
--opt build-arg:SOURCE_DATE_EPOCH=1707129707
--opt target=deploy
--opt filename=./Storybook.Dockerfile
-o type=image,"name=XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:change-c06b70bf98322eb59b6f58d86ea67ba4,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG,XXXXXX.dkr.ecr.eu-west-1.amazonaws.com/YYYYY:dev-TAG",push=true,compression=estargz,rewrite-timestamp=true
Stack trace:
google.golang.org/grpc.getChainUnaryInvoker.func1
13:05:20 /src/vendor/google.golang.org/grpc/clientconn.go:519
13:05:20 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryClientInterceptor.func1
13:05:20 /src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:110
13:05:20 github.com/moby/buildkit/client.New.filterInterceptor.func5
13:05:20 /src/client/client.go:387
13:05:20 google.golang.org/grpc.DialContext.chainUnaryClientInterceptors.func3
13:05:20 /src/vendor/google.golang.org/grpc/clientconn.go:507
13:05:20 google.golang.org/grpc.(*ClientConn).Invoke
13:05:20 /src/vendor/google.golang.org/grpc/call.go:35
13:05:20 github.com/moby/buildkit/api/services/control.(*controlClient).Solve
13:05:20 /src/api/services/control/control.pb.go:2234
13:05:20 github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20 /src/client/solve.go:274
13:05:20 golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20 /src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20 runtime.goexit
13:05:20 /usr/local/go/src/runtime/asm_amd64.s:1650
13:05:20
13:05:20 github.com/moby/buildkit/client.(*Client).solve.func2
13:05:20 /src/client/solve.go:290
13:05:20 golang.org/x/sync/errgroup.(*Group).Go.func1
13:05:20 /src/vendor/golang.org/x/sync/errgroup/errgroup.go:75
13:05:20 runtime.goexit
13:05:20 /usr/local/go/src/runtime/asm_amd64.s:1650
@emalihin did you find a solution for your problem?
I've rolled back to 0.13.0-beta1 for now. Are both of your --import-cache already populated? I might have missed this step trying to reproduce in an isolated env..
I've rolled back to 0.13.0-beta1 for now. Are both of your
--import-cachealready populated? I might have missed this step trying to reproduce in an isolated env..
I see. In this specific example latest-cache-v0.13.1 is populated while TAG-v0.13.1 is not. But I was able to reproduce the issue also with both caches populated.
Actually I notice this start happening after upgrade from moby/buildkit:v0.13.0-beta3 to v0.13.0 (and also I see this in moby/buildkit:v0.13.1, so I'm going to revert to v.0.13.0-beta3 and see if it solves the issue.
My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache. Also I build multiple images in parallel, and some images have the issue and some other not, but I'm not able to find any pattern in the behavior.
Might be a regression in https://github.com/moby/buildkit/pull/4663 (v0.13.0-rc2) ?
My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.
In my testing of 0.13.0 and 0.13.1 first cache write worked, but first and following cache reads failed. However when I went back to 0.13.1-beta1 - cache reads worked, even with the caches written by 0.13.0 and 0.13.1.
Might be a regression in #4663 (v0.13.0-rc2) ?
I'm testing v0.13.0-rc1 and currently I'm not seeing the issue (but previously I noticed that could happen after a while, so I'm not 100% secure that rc1 doesn't have this issue).
My guess is that somehow the cache is "poisoned", so when a cache get the issue, I'll continue to see this happening until I manually delete the cache.
In my testing of
0.13.0and0.13.1first cache write worked, but first and following cache reads failed. However when I went back to0.13.1-beta1- cache reads worked, even with the caches written by0.13.0and0.13.1.
I didn't check that. Our CI generates different cache tags if we change the buildkit version, so I'm not able to test this scenario.
This PR may fix your issue too
- https://github.com/moby/buildkit/pull/4804
This PR may fix your issue too
Happy to test it as soon as a build is available. Thank you.
I found a way to repro it on 0.13.2. Dangling (unreferenced) stages appear to cause problems when using rewrite-timestamp=true.
Example Dockerfile:
FROM debian:bookworm as base
FROM debian:bookworm-slim
RUN echo "foo"
Might need to reopen this or file a new issue?
@Manbeardo
Thanks for reporting, but I can't repro the issue with BuildKit v0.13.2 on Ubuntu 24.04
sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --output type=oci,name=build-0,dest=/tmp/build-0.tar,dir=false,rewrite-timestamp=true --no-cache
Here's a script to more precisely repro what I've seen (using a docker-container buildx driver at v0.13.2):
#!/usr/bin/env bash
set -xeuo pipefail
cd "$(mktemp -d)"
cat >Dockerfile <<'EOF'
FROM debian:bookworm-slim AS base
FROM debian:bookworm
EOF
OUTPUT_IMAGE="<IMAGE FROM PRIVATE REGISTRY>"
CACHE_IMAGE="<DIFFERENT IMAGE FROM PRIVATE REGISTRY>"
docker buildx build \
--pull \
--output "name=$OUTPUT_IMAGE,push=true,rewrite-timestamp=true,type=image" \
--cache-to "ref=$CACHE_IMAGE,image-manifest=true,mode=max,oci-mediatypes=true,type=registry" \
--platform linux/arm64/v8,linux/amd64 \
.
That yields this error:
------
> exporting to image:
------
ERROR: failed to solve: failed to push [REDACTED PRIVATE IMAGE NAME]: content digest sha256:60bdaf986dbe787297bb85c9f6a28d13ea7b9608b95206ef7ce6cdea50cd5505: not found
This can probably be reduced to a more minimal repro, but this was the first config that I got a repro out of as I gradually added configuration from my full build pipeline. My build pipeline had a 100% failure rate when running on my AWS CI builders, but appears to fail inconsistently (race condition?) when I build on my macbook.
Either way, removing the unused stage from the Dockerfile eliminates the error.
could that be related to https://github.com/containerd/containerd/pull/10187 @AkihiroSuda ?
Minimal repro:
sudo buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --no-cache --output type=image,name=localhost:5000/foo,push=true,rewrite-timestamp=true
(Needs push=true && rewrite-timestamp=true)
I also see an occasional panic in the daemon log
WARN[2024-05-15T02:44:50+09:00] rewrite-timestamp is specified, but no source-date-epoch was found spanID=cbbd84754ca443c5 traceID=a347fa5861b3af65b3508699ef6f1797
panic: send on closed channel
goroutine 671 [running]:
github.com/containerd/containerd/remotes/docker.(*pushWriter).setPipe(...)
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:364
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func1()
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:286 +0xcc
github.com/containerd/containerd/remotes/docker.(*request).do(0xc000d5b5f0, {0x295a620, 0xc000581410})
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:556 +0x162
github.com/containerd/containerd/remotes/docker.(*request).doWithRetries(0xc000d5b5f0, {0x295a620, 0xc000581410}, {0x0, 0x0, 0x0})
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/resolver.go:600 +0x47
github.com/containerd/containerd/remotes/docker.dockerPusher.push.func2()
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:292 +0x57
created by github.com/containerd/containerd/remotes/docker.dockerPusher.push in goroutine 787
/home/suda/gopath/src/github.com/moby/buildkit/vendor/github.com/containerd/containerd/remotes/docker/pusher.go:291 +0x2150
Any movement or changes to this @AkihiroSuda?
Any movement or changes to this @AkihiroSuda?
Not yet, PR is welcome
PR:
- #5008