buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Builds stuck in "preparing build cache for export" stage

Open razzmatazz opened this issue 6 months ago • 20 comments

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • [x] I have found a bug that the documentation does not mention anything about my problem
  • [x] I have found a bug that there are no open or closed issues that are related to my problem
  • [x] I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

I am seeing buildkit failing to leave the "preparing build cache for export" stage when building a large/multi stage image. It does seem to pass from time-to-time (after ~ 1-2 hours), but mostly it seems to be stuck in "checkLoops"/"removeLoops" fn, with CPU pegged to 100%.

docker buildx build invocation specifies "registry" cache with two --cache-from (type=registry) flags (same repo, 2 different tags–if that makes a difference) and a single --cache-to (type=registry)

pprof output:

(pprof) top50 -cum
Showing nodes accounting for 118.43s, 98.06% of 120.77s total
Dropped 139 nodes (cum <= 0.60s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    119.92s 99.30%  github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal
         0     0%     0%    119.92s 99.30%  github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize
     9.64s  7.98%  7.98%    119.92s 99.30%  github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops
         0     0%  7.98%    119.92s 99.30%  github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).removeLoops
         0     0%  7.98%    119.90s 99.28%  github.com/moby/buildkit/cache/remotecache.(*contentCacheExporter).Finalize
         0     0%  7.98%    119.70s 99.11%  github.com/moby/buildkit/solver/llbsolver.runCacheExporters.func1.1
         0     0%  7.98%    119.20s 98.70%  github.com/moby/buildkit/solver/llbsolver.inBuilderContext.func1
         0     0%  7.98%    117.89s 97.62%  github.com/moby/buildkit/solver.(*Job).InContext
         0     0%  7.98%    116.29s 96.29%  github.com/moby/buildkit/solver/llbsolver.inBuilderContext
         0     0%  7.98%    114.14s 94.51%  github.com/moby/buildkit/solver/llbsolver.runCacheExporters.func1
         0     0%  7.98%    111.42s 92.26%  golang.org/x/sync/errgroup.(*Group).Go.func1
    22.34s 18.50% 26.48%     33.73s 27.93%  runtime.mapiternext
     6.20s  5.13% 31.61%     32.42s 26.84%  runtime.mapiterinit
    14.93s 12.36% 43.98%     28.40s 23.52%  runtime.mapaccess2_faststr
    17.66s 14.62% 58.60%     17.66s 14.62%  aeshashbody
     0.61s  0.51% 59.10%     14.30s 11.84%  github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops.func1
     7.89s  6.53% 65.64%     13.69s 11.34%  runtime.mapdelete_faststr
     5.69s  4.71% 70.35%     11.06s  9.16%  runtime.mapassign_faststr
     9.67s  8.01% 78.36%      9.67s  8.01%  runtime.add (inline)
     5.69s  4.71% 83.07%      9.03s  7.48%  runtime.mapaccess2_fast64
     2.53s  2.09% 85.16%      6.27s  5.19%  runtime.(*bmap).overflow (inline)
     2.39s  1.98% 87.14%      6.04s  5.00%  runtime.rand
     3.51s  2.91% 90.05%      3.51s  2.91%  runtime.isEmpty (inline)
     0.22s  0.18% 90.23%      3.35s  2.77%  internal/chacha8rand.(*State).Refill
     3.13s  2.59% 92.82%      3.13s  2.59%  internal/chacha8rand.block
     1.94s  1.61% 94.43%      1.94s  1.61%  runtime.memhash64
     1.43s  1.18% 95.61%      1.43s  1.18%  runtime.strhash
     0.88s  0.73% 96.34%      0.88s  0.73%  runtime.tophash (inline)
     0.71s  0.59% 96.93%      0.71s  0.59%  internal/abi.(*Type).Pointers (inline)
     0.68s  0.56% 97.49%      0.68s  0.56%  runtime.duffzero
     0.06s  0.05% 97.54%      0.66s  0.55%  runtime.bucketMask (inline)
     0.63s  0.52% 98.06%      0.63s  0.52%  runtime.bucketShift (inline)

Reproduction

It may be difficult to reproduce and I cannot ship Dockerfile as that is private but it does appear from time to time,–and I believe #2009 is related.

Version information

Running buildkitd in docker-container mode, v0.21.1 (the current moby/buildkit:buildx-stable-1).

~$ docker buildx version && docker buildx inspect
github.com/docker/buildx v0.24.0 d0e5e86
Name:          gha-runner-vm-builder
Driver:        docker-container
Last Activity: 2025-06-03 14:52:14 +0000 UTC

Nodes:
Name:                  gha-runner-vm-builder0
Endpoint:              unix:///var/run/docker.sock
Driver Options:        network="host"
Status:                running
BuildKit daemon flags: --allow-insecure-entitlement=network.host
BuildKit version:      v0.21.1
Platforms:             linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
Labels:
 org.mobyproject.buildkit.worker.executor:         oci
 org.mobyproject.buildkit.worker.hostname:         mayhem-gha-runner
 org.mobyproject.buildkit.worker.network:          host
 org.mobyproject.buildkit.worker.oci.process-mode: sandbox
 org.mobyproject.buildkit.worker.selinux.enabled:  false
 org.mobyproject.buildkit.worker.snapshotter:      overlayfs
File#buildkitd.toml:
 > debug = true
 >
 > [grpc]
 >   debugAddress = "0.0.0.0:6060"
 >
 > [worker]
 >
 >   [worker.oci]
 >     gc = false
 >

and

~$ docker version && docker info
Client: Docker Engine - Community
 Version:           28.2.2
 API version:       1.50
 Go version:        go1.24.3
 Git commit:        e6534b4
 Built:             Fri May 30 12:07:27 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.2.2
  API version:      1.50 (minimum version 1.24)
  Go version:       go1.24.3
  Git commit:       45873be
  Built:            Fri May 30 12:07:27 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
Client: Docker Engine - Community
 Version:    28.2.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.24.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.36.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 108
 Server Version: 28.2.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 6.8.0-60-generic
 Operating System: Ubuntu 24.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 48
 Total Memory: 47.03GiB
 Name: clint-vm-1
 ID: f696c190-9a0a-4598-b3a1-98f47405a8f0
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

razzmatazz avatar Jun 03 '25 15:06 razzmatazz

There is a very similar trace reported here: https://github.com/earthly/earthly/issues/1187#issuecomment-992601007

razzmatazz avatar Jun 03 '25 15:06 razzmatazz

Do you have a reproducer?

tonistiigi avatar Jun 04 '25 17:06 tonistiigi

Sadly no, the issue is intermittent and goes away after killing the cache which I had to do to unblock production C/D :(

razzmatazz avatar Jun 05 '25 11:06 razzmatazz

Hi, I'm having the exact same issue. It only happens when I'm trying to push the cache to a registry. This is the --print of my image

      "cache-from": [
        {
          "ref": "XXX.dkr.ecr.eu-west-1.amazonaws.com/<IMAGE>:cache",
          "type": "registry"
        }
      ],
      "cache-to": [
        {
          "mode": "max",
          "ref": "XXX.dkr.ecr.eu-west-1.amazonaws.com/<IMAGE>:cache",
          "type": "registry"
        }
      ],
      "output": [
        {
          "type": "registry"
        }
Client: Docker Engine - Community
 Version:           28.2.2
 API version:       1.50
 Go version:        go1.24.3
 Git commit:        e6534b4
 Built:             Fri May 30 12:07:28 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.2.2
  API version:      1.50 (minimum version 1.24)
  Go version:       go1.24.3
  Git commit:       45873be
  Built:            Fri May 30 12:07:28 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.24.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.36.2
    Path:     /home/user/.docker/cli-plugins/docker-compose

It's a large image with a a few multi-stage steps, the cache is around 3GB. I was able to push it once to the registry, didn't take long, but now it's always at the "preparing build cache for export" yesterday I left it running for ~5hours and still didn't finish it.

Also, even after I cancel the command, buildkitd process keeps running forever taking ~40% CPU

israelglar avatar Jun 06 '25 10:06 israelglar

Hello I believe I am seeing this as well.. I tried updating to 0.23.1 but still seeing the same issue

joshzarrabi-picnic avatar Jun 23 '25 22:06 joshzarrabi-picnic

I'm seeing this issue with raw buildctl and buildkitd in 0.23.1 (both rootless and non-rootless mode) using multiple DockerHub image imports/exports such as:

  • --export-cache type=registry,ref=org/repo:tag,mode=max,compression=zstd,force-compression=true,compression-level=3,oci-mediatypes=true,ignore-error=true

I don't have specific steps to reproduce yet, but it's happening extremely consistently now.

korbin avatar Jun 24 '25 16:06 korbin

Killing build cache registry seems to recover it for me only to regress after some time. Seems like some sort of cache data corruption issue related to these data structures where cyclic(?) references are introduced accidentally and then the validation logic gets stuck looping iterating these loops.

razzmatazz avatar Jun 25 '25 05:06 razzmatazz

@razzmatazz yeah, the only thing that really fixes it for me is totally deleting my jenkins workers and starting from scratch

joshzarrabi-picnic avatar Jun 25 '25 14:06 joshzarrabi-picnic

Rolling back to 0.22.x seems to have fixed this problem for us.

korbin avatar Jun 27 '25 04:06 korbin

Rolling back to 0.22.x seems to have fixed this problem for us.

Thanks, was plagued all last week by this issue, and the moment we reverted back to 0.22.x (both server & client) everything settled down for us.

We tried various things: purging the cache in the registry, wiping/resizing our local cache volumes, tuning garbage collection policies, ignoring cache errors, etc... nothing really made the problem go away permanently.

hrivera-ntap avatar Jul 02 '25 14:07 hrivera-ntap

If you can reproduce and think this is a regression, can you bisect the issue to understand the breaking point.

https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md#regressions

tonistiigi avatar Jul 02 '25 21:07 tonistiigi

Rolling back to 0.22.x seems to have fixed this problem for us.

Actually, I was seeing this (intermittently) for about a year or two now so I am not sure if this is a recently regression.

razzmatazz avatar Jul 04 '25 10:07 razzmatazz

I'm 100% not sure, but it seems I'm experiencing the same issue.

My use case: In a Docker-in-Docker environment, I prepare and publish 100+ custom Kibana Docker images.

I use:

  BUILDX_VER="v0.25.0"
  BUILDKIT_VER="v0.23.2"

I create a builder like this:

docker buildx create \
    --name        "$BUILDER_NAME" \
    --driver      docker-container \
    --driver-opt  "image=moby/buildkit:$BUILDKIT_VER" \
    --buildkitd-flags "--debug --trace" \
    --platform    linux/amd64,linux/arm64 \
    --use --bootstrap

Image is built and pushed like this:

+ docker buildx build --no-cache --platform linux/arm64,linux/amd64 --push --build-arg KBN_VERSION=8.14.1 --build-arg ROR_PLUGIN_PATH=builds/readonlyrest_kbn_universal-1.64.2_es8.14.1.zip -f Dockerfile -t beshultd/kibana-readonlyrest:8.14.1-ror-1.64.2 -t beshultd/kibana-readonlyrest:8.14.1-ror-latest .
#0 building with "ror_kbn_builder_1751882513" instance using docker-container driver

What do I experience? After releasing several images, the build is stuck. It seems it is stuck in the exporting image phase. See attached log.

log0001.txt

Previously: With old versions of Buildx and Buildkit, I had this: https://github.com/moby/buildkit/issues/5784#issuecomment-2906722423

Workaround: Every 5 published versions, I remove the builder and its container and create a new one.

Version information:

 docker buildx version && docker buildx inspect
github.com/docker/buildx v0.25.0 faaea65da4ba0e58a13cd9cadcb950c51cf3b3c9
Name:          ror_kbn_builder_1751884090
Driver:        docker-container
Last Activity: 2025-07-07 10:28:14 +0000 UTC

Nodes:
Name:                  ror_kbn_builder_17518840900
Endpoint:              unix:///var/run/docker.sock
Driver Options:        image="moby/buildkit:v0.23.2"
Status:                running
BuildKit daemon flags: --debug --trace --allow-insecure-entitlement=network.host
BuildKit version:      v0.23.2
Platforms:             linux/amd64*, linux/arm64*, linux/arm/v7, linux/arm/v6
Features:
 Automatically load images to the Docker Engine image store: false
 Cache export:                                               true
 Direct push:                                                true
 Docker exporter:                                            true
 Multi-platform build:                                       true
 OCI exporter:                                               true
Labels:
 org.mobyproject.buildkit.worker.executor:         oci
 org.mobyproject.buildkit.worker.hostname:         280ecd186224
 org.mobyproject.buildkit.worker.network:          host
 org.mobyproject.buildkit.worker.oci.process-mode: sandbox
 org.mobyproject.buildkit.worker.selinux.enabled:  false
 org.mobyproject.buildkit.worker.snapshotter:      overlayfs
GC Policy rule#0:
 All:            false
 Filters:        type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration:  48h0m0s
 Max Used Space: 488.3MiB
GC Policy rule#1:
 All:            false
 Keep Duration:  1440h0m0s
 Reserved Space: 7.451GiB
 Max Used Space: 55.88GiB
 Min Free Space: 13.97GiB
GC Policy rule#2:
 All:            false
 Reserved Space: 7.451GiB
 Max Used Space: 55.88GiB
 Min Free Space: 13.97GiB
GC Policy rule#3:
 All:            true
 Reserved Space: 7.451GiB
 Max Used Space: 55.88GiB
 Min Free Space: 13.97GiB
docker version && docker info
Client:
 Version:           26.1.4
 API version:       1.45
 Go version:        go1.21.11
 Git commit:        5650f9b
 Built:             Wed Jun  5 11:27:58 2024
 OS/Arch:           linux/arm64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.4
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       de5c9cf
  Built:            Wed Jun  5 11:29:18 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          v1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
Client:
 Version:    26.1.4
 Context:    default
 Debug Mode: true
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.25.0
    Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/local/lib/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.0-21-arm64
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 7.567GiB
 Name: 8747dd7514fa
 ID: 14918813-099e-43ca-bc02-9602a7aa0fbe
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 41
  Goroutines: 58
  System Time: 2025-07-07T10:29:51.984889697Z
  EventsListeners: 0
 Username: readonlyrestkbn
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

coutoPL avatar Jul 07 '25 10:07 coutoPL

I couldn't create a reproducible environment outside my project but I kept having this issue when running what I have so created a PR that fixed the issue on my project. #6082 I'm assuming it has something to do with multi-stage builds because it happens more consistently with some of my images, but that's just a guess.

@razzmatazz your profiling data helped narrow it down. @coutoPL and @korbin since you seem to have this happening consistently maybe you can give this a go?

israelglar avatar Jul 11 '25 16:07 israelglar

I understand that my PR doesn't completely solve the issue but I have good news! I was able to create a reproducible environment @tonistiigi The first and second time I run the build command it works fine. But the 3rd and 4th it gets stuck on the checkLoops function.

I reproduced this while using ECR as cache registry.

The build command is stuck on

 => [test_sub1] exporting cache to registry
 => => preparing build cache for export

To reproduce

Run this build command 3-4 times: docker compose -f docker-compose.yml build test_service Use this docker-compose.yml file

services:
  test_service:
    build:
      cache_from:
        - XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_service
      cache_to:
        - type=registry,mode=max,ref=XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_service
      x-bake:
        output: type=registry
      additional_contexts:
        test_sub1: service:test_sub1
        test_sub2: service:test_sub2
      dockerfile_inline: |
        FROM alpine as deps
        COPY --from=test_sub1 /file1 /file1
        COPY --from=test_sub2 /deps1 /deps1

        FROM deps as prod-deps
        RUN echo "This is a production dependency stage" > /file1

        FROM alpine as dev
        COPY --from=deps /file1 /file1
        COPY --from=deps /deps1 /deps1

        FROM alpine as prod
        COPY --from=deps /deps1 /deps1
        COPY --from=prod-deps /file1 /file1

  test_sub1:
    build:
      cache_from:
        - XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_sub1
      cache_to:
        - type=registry,mode=max,ref=XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_sub1
      x-bake:
        output: type=registry
      additional_contexts:
        test_sub2: service:test_sub2
        test_innersub1: service:test_innersub1
      dockerfile_inline: |
        FROM alpine as deps
        COPY --from=test_sub2 /file1 /file1
        COPY --from=test_innersub1 /deps1 /deps1
        FROM alpine
        COPY --from=deps /file1 /file1
        RUN touch /file2

  test_sub2:
    build:
      cache_from:
        - XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_sub2
      cache_to:
        - type=registry,mode=max,ref=XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_sub2
      x-bake:
        output: type=registry
      dockerfile_inline: |
        FROM alpine as deps
        RUN touch /deps1
        RUN touch /deps2

        FROM alpine
        COPY --from=deps /deps1 /deps1
        RUN touch /file1

  test_innersub1:
    build:
      cache_from:
        - XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_innersub1
      cache_to:
        - type=registry,mode=max,ref=XXXXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/test-repo:test_innersub1
      x-bake:
        output: type=registry
      dockerfile_inline: |
        FROM alpine as deps
        RUN touch /deps1
        RUN touch /deps2

        FROM alpine
        COPY --from=deps /deps1 /deps1
        RUN touch /file1

israelglar avatar Jul 14 '25 14:07 israelglar

Ran the repro steps from @israelglar (thank you!), with some modifications to push to my own ECR and such of course.

Info: BuildKit: github.com/moby/buildkit v0.20.0 121ecd5b9083b8eef32183cd404dd13e15b4a3df

It ran fine 6 times, so I shrunk down my cache volume to 2GiB and tried again. The builds still ran fine for another 6 rounds, until I decided to force one of the cross-target files to be large. In the example above, I just changed the RUN touch /file1 instruction for test_sub2 to: RUN dd if=/dev/urandom of=/file1 bs=1024 count=1G Which generates a huge file of course, and promptly filled up my buildkit volume.

From there, I reverted my change and re-ran the build - and now it fails every time. The large partial blob was removed from the cache, per the buildkitd logs and watching the volume stats, but every attempt to build now hangs on [test_service] exporting cache to registry.

Watching the debug logs, I get a pretty consistent set of entries about removing content, and then a 'schedule content cleanup' message - then nothing at all until I manually kill the build. Once it's killed, I get the expected messages about sessions finishing, and things look to be back to normal. I let it go for 15 minutes, and got the same results of 'a void' in the logs.

An unrelated simple build with ECR caching enabled worked fine repeatedly, so it certainly appears to be something about the that specific blob being cached from that specific image that has things hung up.

As a quick check, I added a couple of simple RUN echo "hi there" instructions around the touch /file1 bit that I had tweaked prior, and it still hangs in the same place. I decided to dig into the actual blob contents to see if the allegedly removed blobs were still around, and in looking up the digests for them I realized the same 4 blobs are 'removed' in the logs on each build, but they do still exist in the cache it seems.

From there I opted to just do a buildctl prune-histories, and while most of the builds were purged from the history it did throw a handful of errors:

error: 9 errors occurred:
        * lease "ref_u8c4dw5qbl4gil9juyt7a3kzc": not found
        * lease "ref_vii9iacmuzi8yvqdtaug206rg": not found
        * lease "ref_2urka2vkq026jynr24f6u7soq": not found
        [ +6 more ]

Not sure I can find time to do a better deep dive, but it's certainly a reliable reproduction method.

[Edit] Forgot my config, only interesting bit is the GC setup:

debug = true
root = "/var/lib/buildkit"

[log]
  format = "json"

[history]
  maxEntries = 500

[worker.oci]
  enabled = true
  gc = true
  maxUsedSpace = "90%"

[worker.containerd]
  enabled = false

cboggs avatar Aug 25 '25 18:08 cboggs

Huh, interestingly I was able to restart buildkitd and it freed up the previously-error'ed build histories, and then I was able to prune the full cache... but a 'fresh' attempt at the docker compose build still hung up, and then three of the 4 build histories it created can't be deleted (same 'lease not found' error). Will do some more digging later, but thought I'd add that bit at least.

cboggs avatar Aug 25 '25 18:08 cboggs

Great news! @tonistiigi either found a solution or is in a great path. His PR #6129 fixes the issue on my end. Maybe you can take a look at it @cboggs ?

These are the results I've got:

==== Runs with current cache storage ===
Run #1 Duration: 12 seconds (push cache)
Run #2 Duration: 4 seconds (only pull)
Run #3 Duration: 4 seconds (only pull)
Run #4 Duration: 3 seconds (only pull)
Run #5 Duration: 5 seconds (only pull)
Run #6 Duration: 60 seconds + canceled (stuck)


==== Runs with new cache storage ====
Run #1 Duration: 9 seconds (push cache)
Run #2 Duration: 4 seconds (only pull)
Run #3 Duration: 3 seconds (only pull)
Run #4 Duration: 4 seconds (only pull)
Run #5 Duration: 3 seconds (only pull)
Run #6 Duration: 3 seconds (only pull)
Run #7 Duration: 3 seconds (only pull)
Run #8 Duration: 3 seconds (only pull) 
Run #9 Duration: 3 seconds (only pull)
Run #10 Duration: 3 seconds (only pull)

israelglar avatar Sep 03 '25 16:09 israelglar

@israelglar Sweet! I'll give it a shot. Took me a bit to get it to repro on my existing setup for whatever reason, but I'll get the new build cranked out and running then follow-up!

cboggs avatar Sep 09 '25 22:09 cboggs

Woohoo! @israelglar @tonistiigi Confirmed that the branch for #6129 does the trick.

  1. Reproduced the issue on 0.20.0 just be sure I wasn't fudging the results.
  2. Built binaries from Tõnis' branch via docker buildx bake binaries and copied them to my test server.
  3. Stopped buildkitd on the server, rm -rf buildkit root dir, moved new binaries into /usr/local/bin, restarted buildkitd.
> buildctl --addr tcp://localhost:9800 debug info

BuildKit: github.com/moby/buildkit v0.24.0-rc2-18-gaed2e4a19 aed2e4a1929f330b42a9557fd152510f1668f390
  1. Re-ran repro steps from before, several times, and observed: a. build fails when disk fills up, as it should b. buildctl commands are still usable while disk is full c. Can manually buildctl prune successfully d. Can wait a minute or so and buildkit cleans up the 'breaking' snapshot on its own e. Subsequent builds without the forced failure succeed just fine, both with / without changes to the layers and with / without existing cache manifests in ECR

All in all... seems good to go! I'll add a comment to the linked PR pointing to this comment to indicate that it should likely close this issue. :-)

cboggs avatar Sep 10 '25 19:09 cboggs