buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Image export to local store via buildx hangs indefinitely

Open rrjjvv opened this issue 3 months ago • 14 comments

buildx-trace.txt docker-trace.txt

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • [x] I have found a bug that the documentation does not mention anything about my problem
  • [x] I have found a bug that there are no open or closed issues that are related to my problem
  • [x] I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

In our CI environments, there have been occurrences of a docker build -t <...> . having the appearance of successfully building and saving, but never actually finishes. The command appears "stuck"; build timeouts eventually kill the process (typically 30 minutes). When this occurs, all other builds for the same project behave the same way ("build" just fine, but never finish) 100% of the time. Builds from other projects appear unaffected.

This is very similar to #6131, and quite possibly a true duplicate. But because the tooling is different, I thought it better to create a new issue (and also so it can be moved to moby, if that's where this belongs).

Reproduction

Unfortunately, I have not been able to reliably reproduce this. We've seen four occurrences of this in the wild across two versions. There were no obvious commonalities, and none are overly complex (not intensive to build, not using advanced buildkit features/flags/options, low line count, using the default builder, etc.). Within the last month or two we have moved to the containerd image store and had never encountered the issue prior to that; not sure if that's relevant or just coincidence.

I have only managed to reproduce this in a controlled fashion once, and took over 16k iterations over two days. My initial guess is that a previous build timeout had caused docker/buildkit to get into a bad state. I created a Dockerfile that would take ten seconds to build, then created a CI job that would timeout (thus kill the docker client) every third build. Sample output:

Timeout set to expire in 15 sec
+ docker buildx build --build-arg '_num=23833' -t junk --load .
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 128B done
#1 DONE 0.1s

#2 [internal] load metadata for docker.io/library/alpine:latest
#2 DONE 0.1s

#3 [internal] load .dockerignore
#3 DONE 0.0s

#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.1s

#4 [1/3] FROM docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1
#4 resolve docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1 0.0s done
#4 CACHED

#5 [2/3] RUN echo "23833" > /tmp/num
#5 DONE 0.4s

#6 [3/3] RUN sleep 10
#6 DONE 10.4s

#7 exporting to image
#7 exporting layers
#7 exporting layers 0.3s done
#7 exporting manifest sha256:9e9676d99c89acd961d0d4ebef2372c0e5a8193301e406da83c665b869d8b366 0.0s done
#7 exporting config sha256:f30382e2a1fe78ec2ba8eb8d3d020612bfadbafd9ec27e83b50a9c879bc19742 0.0s done
#7 exporting attestation manifest sha256:3ef78c1a99d0c4a4224abe6a557c9bdc08bf4635c1099eab8acf499d2c775122 0.1s done
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4 0.0s done
#7 naming to docker.io/library/junk:latest done
#7 unpacking to docker.io/library/junk:latest 0.1s done
#7 DONE 0.6s
<---------  This is where the build hangs indefinitely until killed  ----------->
Cancelling nested steps due to timeout
Sending interrupt signal to process
ERROR: failed to build: failed to solve: Canceled: context canceled

My other guess is that our (docker) cleanup routines are involved (i.e., concurrent adds/deletes).

All the observations below strongly suggest that the builds that "hang" until killed are still running (conceptually and literally).

Build history

As mentioned, it took 16k builds before I was able to trigger the issue, which was on Saturday night. I didn't see that until Monday morning, after another 7k builds had run, and why some of these figures are so extreme.

On the host there is virtually no docker-related activity, yet there are over 5k "running" buildkit builds (which I assume are all related to this reproduction):

# docker buildx history ls --filter status=running | sed 1d | wc -l
5026
# docker buildx history ls --filter status=running | sed 1d | head -n5
qn92ozxmdey4l9er97jouw9vo             Running   34 hours ago   2055m 31s+   
yyld3tjznvz9fcodvnpj1ea8p             Running   34 hours ago   2055m 52s+   
zejsagmbp47kl4qzhgmpwta6c             Running   34 hours ago   2056m 25s+   
o6fwaq4rdhiuzro4eysdj0lbr             Running   34 hours ago   2056m 46s+   
y3a2tbfvbt38o8z524md6y40b             Running   34 hours ago   2057m 18s+
# docker buildx history ls --filter status=running | sed 1d | tail -n5
rne18nqs4v0xk3o2di0u5n8ue             Running   3 days ago     4327m 23s+   
y86jnwnej9ml6auoan9iq9rgw             Running   3 days ago     4327m 47s+   
km7ac8r8awaf5xr3oojpapnzc             Running   3 days ago     4328m 20s+   
frwu4n2n0k1fr1goxy1o8k33d             Running   3 days ago     4328m 45s+   
wtg2hh8l4mlrtrdtnny5vy08a             Running   3 days ago     4329m 44s+

<-- oldest entry -->
# docker buildx history inspect wtg2hh8l4mlrtrdtnny5vy08a
Dockerfile:	Dockerfile

Started:	2025-09-06 19:25:28
Duration:	4331m  4s (running)
Build Steps:	0/0 (NaN% cached)

Image Resolve Mode:	local

BUILD ARG	VALUE
_num		16141

<-- newest entry -->
# docker buildx history inspect y3a2tbfvbt38o8z524md6y40b
Dockerfile:	Dockerfile

Started:	2025-09-08 09:17:34
Duration:	2058m 26s (running)
Build Steps:	0/0 (NaN% cached)

Image Resolve Mode:	local

BUILD ARG	VALUE
_num		23828

Cache

The buildkit cache has an extremely high number of non-reclaimable entries that aren't in use, matching the behavior in the linked ticket:

# docker buildx du | awk '{if ($2 == "false" && NF == 3) { print $0 }}' | wc -l
20106

Performance

Despite the host being effectively idle (no actual build activity), the docker daemon is eating up 1/3 (4GB) of the hosts memory and 50% of the CPU (i.e., 400% on an eight core machine).

I took a stacktrace dump of dockerd (which appears to be the equivalent of the format you guys like from pprof). It's humongous... docs say that GH doesn't allow attachments that big, and I didn't know if you'd actually want it (but let me know):

# ls -hl /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
-rw-r--r-- 1 root root 43M Sep  8 16:53 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# wc -l /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
526979 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# grep '^goroutine ' /var/run/docker/goroutine-stacks-2025-09-08T165305-0600.log | wc -l
32646

This seemed pretty telling to me:

Image

That number (5026) matches up exactly with the number of so-called running buildkit builds; it seems as if they can't agree on what needs to be done. I haven't taken multiple snapshots to see if that distribution changes (i.e., whether they are constantly undoing each others' work, or if it's closer to an actual deadlock), but I can if it'll help.

Here's an example from 'Begin' side:

goroutine 2050617244 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576571a46c0?, 0xd0?, 0xc00087cf5c?)
        /usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
        /usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).beginTx(0xc00087cd88)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:772 +0x46
go.etcd.io/bbolt.(*DB).Begin(0xc00087cd88, 0x0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:758 +0x1cc
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc08ed3aa20)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:923 +0x30
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc080653040, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee0505e1c?, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc0499bce40, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc0714d4690})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc085117b00, {0x557655b5c748, 0xc0714d4690})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc02d2e80e0?}, {0x557655b62a40, 0xc06f685360}, 0xc0acc7cf00)
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 2050616659
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76

and the 'Rollback' side:

goroutine 1964312991 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576558c7a40?, 0xaa?, 0x34?)
        /usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
        /usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).removeTx(0xc00087cd88, 0xc01b6621c0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:857 +0x76
go.etcd.io/bbolt.(*Tx).close(0xc01b6621c0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:368 +0x168
go.etcd.io/bbolt.(*Tx).nonPhysicalRollback(0xc01b6621c0?)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:319 +0x4d
go.etcd.io/bbolt.(*Tx).Rollback(0xc00087cd88?)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:307 +0x25
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc0e2f9c5d0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:942 +0x8c
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc027a73580, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee04f1135?, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc031e0d700, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc03ba7cde0})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc03ba7d5f0, {0x557655b5c748, 0xc03ba7cde0})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc0001168a8?}, {0x557655b62a40, 0xc032181b80}, 0xc026b760c0)
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 1964312482
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76

Full dumps for the docker client itself and the buildx plugin (across two executions by necessity) when they are "hung" are attached.

I can probably keep this host in its current state for another day if there are diagnostics you'd find useful. If you want the entire track dump, I can put in a repo or something.

Version information

Two of the five occurrences were on this version, the rest on an earlier version (probably 28.3.2).

# docker version
Client: Docker Engine - Community
 Version:           28.3.3
 API version:       1.51
 Go version:        go1.24.5
 Git commit:        980b856
 Built:             Fri Jul 25 11:37:02 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.3.3
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.24.5
  Git commit:       bea959c
  Built:            Fri Jul 25 11:33:59 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

# docker buildx version 
github.com/docker/buildx v0.26.1 1a8287f

# docker buildx inspect 
Name:          default
Driver:        docker
Last Activity: 2025-09-10 03:00:06 +0000 UTC

Nodes:
Name:             default
Endpoint:         default
Status:           running
BuildKit version: v0.23.2
Platforms:        linux/amd64, linux/amd64/v2, linux/amd64/v3
Labels:
 org.mobyproject.buildkit.worker.containerd.namespace: moby
 org.mobyproject.buildkit.worker.containerd.uuid:      3343a640-79a9-499a-ab40-db2d41946a35
 org.mobyproject.buildkit.worker.executor:             containerd
 org.mobyproject.buildkit.worker.hostname:             ####################
 org.mobyproject.buildkit.worker.moby.host-gateway-ip: 172.17.0.1
 org.mobyproject.buildkit.worker.network:              host
 org.mobyproject.buildkit.worker.selinux.enabled:      false
 org.mobyproject.buildkit.worker.snapshotter:          overlayfs
GC Policy rule#0:
 All:            false
 Filters:        type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration:  48h0m0s
 Max Used Space: 1.029GiB
GC Policy rule#1:
 All:            false
 Keep Duration:  1440h0m0s
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB
GC Policy rule#2:
 All:            false
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB
GC Policy rule#3:
 All:            true
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB

# docker info 
Client: Docker Engine - Community
 Version:    28.3.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.39.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 7
 Server Version: 28.3.3
 Storage Driver: overlayfs
  driver-type: io.containerd.snapshotter.v1
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.14.0-570.32.1.el9_6.x86_64
 Operating System: Rocky Linux 9.6 (Blue Onyx)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 11.43GiB
 Name: #################
 ID: 641d878b-c0e9-486a-b05a-d275591b7c29
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 20162
  Goroutines: 32871
  System Time: 2025-09-09T20:45:43.785297833-06:00
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Registry Mirrors:
  #################
 Live Restore Enabled: false

rrjjvv avatar Sep 10 '25 03:09 rrjjvv

+1 I've been following #6131 and running into both of these problems. They seem related.

zachspar avatar Sep 10 '25 07:09 zachspar

~I hadn't come across that one (thanks!), but definitely seems similar. I'll have to study it closer tomorrow to see if I can glean any clues for creating a consistent reproducer (that is eluding everybody).~ (I didn't realize that was the same issue I had linked to, clicked that link, then got it mixed up with a different issue I had open in another tab.)

As well, #4090 has the same symptoms, despite being an older ticket and a very specific scenario (shutdown) that's not relevant in my case.

rrjjvv avatar Sep 10 '25 09:09 rrjjvv

Can you test with https://github.com/moby/buildkit/pull/6129 ?

Edit: or maybe https://github.com/moby/buildkit/pull/6186

tonistiigi avatar Sep 11 '25 00:09 tonistiigi

@tonistiigi Do you have any hints/suggestions on how to force this issue to manifest? Or at least increasing the likelihood?

All of our occurrences, including my one successful brute-force attempt, have been in CI (Jenkins), which of course shouldn't be directly relevant. But I've yet to see it outside of a Jenkins build. I have multiple experiments going non-stop with different variables, namely:

  • multiple concurrent executions of 'the same' build (vs. sequential elsewhere)
  • concurrent prunes while building (vs. no deletions elsewhere)
  • for docker, 28.3.x (where all occurrences have happened), vs. 28.4
  • docker-container driver (with all debugging options) vs. default driver (where all occurrences have happened)

These experiments haven't induced the behavior yet. If I can't trigger the bug in a controlled manner (and in a reasonable timeframe), the best I can do is validate those patches don't make things worse.

Based on that diff, it sounds like version (docker or buildkit) doesn't really matter, nor concurrent builds (with the possible exception of the GC that happens after a build), or driver. But focusing on pruning and/or GC perhaps?

(Also, I still have that host in that 'bad' state to do any further diagnostics if it helps, but probably only one more day.)

rrjjvv avatar Sep 11 '25 03:09 rrjjvv

(Also, I still have that host in that 'bad' state to do any further diagnostics if it helps, but probably only one more day.)

If you add the patches above to the node that is already in "bad state", it should still be able to figure it out. Neither of these patches changes the storage on disk.

Most likely this is caused some loops or excessive branching in the cache graph that has happened over time over many modifications of the different areas of that graph. Although there was some repro that seemed to rely on letting build error by running out of disk space. That could be a coincidence, or it might be that some writes to internal databases were skipped as well in this error state, which led to something unexpected in the cache graph.

We have some internal tooling that verifies consistency of the boltdb cache database(cc @jsternberg ) that could clean up and make available in open source. The benefit of that would be that although some consistency error in cache db shouldn't immediately cause the cache export to hang or loop, it could help pinpoint the exact sequence of steps that triggered some weird state.

tonistiigi avatar Sep 11 '25 04:09 tonistiigi

Yeah, I was looking closer at the diff (esp. the test case), and the most straightforward way to reproduce it (by way of seeing buildkit history grow, goroutine count grow, etc.) would be to manipulate the underlying DB. But through previous experimentation unrelated to this issue (I've yet to decide whether to file something), I know buildkit doesn't open the DB in a way that allows others to write.

To your point about some repros requiring disk exhaustion, my one successful controlled reproduction was muddied by the fact the host was under extreme memory pressure minutes before the bug showed itself. I had no reason to think it was relevant, but the timing was uncanny.

If you add the patches above to the node that is already in "bad state", it should still be able to figure it out. Neither of these patches changes the storage on disk.

My concern wasn't being able to recover from the issue (a reboot has resolved it 100% of the time), but losing a live example of a host exhibiting the problem. That said, with the previous "in the wild" occurrences I've seen, a simple restart of dockerd/containerd wasn't sufficient to recover; an actual reboot was needed (and why I've given so much attention to this, despite it happening 'only' four times across hundreds/thousands of builds). So tomorrow, with fresh eyes, I'll restart dockerd/containerd in the hopes it's still in a bad state. If it is, I'll create a custom build with the patch; if it recovers by just restarting the daemons (and without a reboot), that will give a lot of confidence that the patch truly solves the problem.

Addendum: should I try each patch separately (and does order matter), or go all-in with both combined?

rrjjvv avatar Sep 11 '25 05:09 rrjjvv

Addendum: should I try each patch separately (and does order matter), or go all-in with both combined?

You can combine them

tonistiigi avatar Sep 11 '25 05:09 tonistiigi

I was hoping that my previously observed behavior (that recovery required a host reboot) would still hold true. But a daemon restart (prior to swapping out binaries) resulted in the daemon being in a good state. No idea whether it was the shutdown (releasing of locks) or startup (DB housekeeping), or what. But all of the 'running' buildkit jobs were gone, which hints that stuff was cleaned up at startup. As well, the CI job that that triggered the condition worked just fine after the daemon restart.

The good news is that none of my other running experiments have triggered the actual hang, but all of them have exhibited a steadily increasing goroutine count (in the thousands), increasing memory usage, and increasing CPU usage even when no docker clients are active. So perhaps export hangs are just one outcome/manifestation of the underlying problem.

I have new binaries running on three hosts where the various experiments are running. But it was probably a mistake to try patching (on top of the docker/buildkit/buildx versions where I've seen this hang) rather than just building master for the various projects. Those binaries have been in place for a couple hours, and I can see those numbers growing already. I'll leave it alone overnight, but assume I'll need to start over tomorrow (assuming the two changes are good and that I'm the one that messed up).

rrjjvv avatar Sep 12 '25 07:09 rrjjvv

increasing goroutine count (in the thousands), increasing memory usage, and increasing CPU usage even when no docker clients are active. So perhaps export hangs are just one outcome/manifestation of the underlying problem.

Send over the stacktraces for these increased goroutines.

tonistiigi avatar Sep 12 '25 17:09 tonistiigi

When I posted last night, those three hosts were approaching the 1k mark, and are now at 6k, 8k, and 11k.

This is capture is with experiments still running, so there's some legit activity. goroutine-stacks-2025-09-12T125905-0600.log

I'll create new binaries (off of master + your changes), rerun these experiments, and see if they fair any better..

rrjjvv avatar Sep 12 '25 19:09 rrjjvv

It's still happening with the new binaries; all are steadily growing, with one of them crossing 1k already (and as I'm about to submit, reaching 2k). Proactive dump before it gets huge:

goroutine-stacks-2025-09-12T155044-0600.log

I didn't see any specific docs around "updating buildkit is special, do X". When I was trying to use 28.3.3 yesterday there was something special (specific to GH actions IIRC), but master seemed much more straightforward. In a nutshell, I:

  1. pulled buildkit at master, rebased your branch
  2. pulled moby at master, added a replace github.com/moby/buildkit => ../buildkit, vendored, etc.
  3. did a docker buildx bake bin-image and pulled the result from my hosts
  4. exported the binaries, overwriting the existing ones (including the ones that didn't seem relevant)
  5. I didn't build new client-side binaries; docker and buildx remain at their previous versions (28.8.3 and v0.26.1 respectively)
  6. on two of the hosts, which are disposable, completely wiped /var/lib/{docker,containerd} prior to starting the experiments

Let me know if I overlooked something or if that's exactly what was needed. And let me know how I can best help... whether capturing periodic dumps, trying to remove docker from the equation, sprinkling some more debug logging, or whatever.

Update: It's starting to look like these leaking goroutines are a separate issue. Those last two dumps ('fixes-on-top-of-28.3.3' from yesterday and 'fixes-on-top-of-master' today) look nearly identical to each other, but very different than the original dump when image export was hanging.

I flipped one of my experiments (focused on concurrent builds) to use the docker-container driver (but still using patched moby). First I used the patched buildkit image (your fixes). It is well behaved... docker daemon sits around 80, and buildkit daemon floats around 100. Then I flipped to moby/buildkit:v0.23.2 (matching the embedded version from my original report). Same favorable behavior..

So it appears that these leaking goroutines are independent from the original issue, but unknown whether it's

  • human error (I didn't build moby correctly, version mismatches, etc.)
  • a result of bleeding-edge changes from master that may go away on their own
  • a legit issue, presumably around how it's used vs. what it does

rrjjvv avatar Sep 12 '25 21:09 rrjjvv

@tonistiigi We encountered this issue again under normal usage using the same release versions as my original report (but a different host). Because it was normal usage and was noticed soon, the dump has a reasonable and can be attached in its entirety: goroutine-stacks-2025-09-16T095505-0600.log

It also has the same goroutines with contention (?) around the DB during export, but is a minority (five, the exact count of so-called running buildkit jobs). The majority are

goroutine 1277117 [select, 29675 minutes]:
google.golang.org/grpc.newClientStreamWithParams.func4()
        /root/rpmbuild/BUILD/src/engine/vendor/google.golang.org/grpc/stream.go:395 +0x8c
created by google.golang.org/grpc.newClientStreamWithParams in goroutine 1276926
        /root/rpmbuild/BUILD/src/engine/vendor/google.golang.org/grpc/stream.go:394 +0xde9

which is identical to the patched versions above.

That suggests that my custom build was just fine and that this is a real issue (whether related or independent). Hopefully the full dump helps.

rrjjvv avatar Sep 16 '25 22:09 rrjjvv

Hello @tonistiigi

I can confirm that this deadlock issue is still present, even after upgrading my workers to moby/buildkit:v0.24.0.

We are experiencing the exact symptoms described in this issue: a buildx client hangs indefinitely after the build reports "DONE", causing our CI pipeline to time out. The corresponding buildkitd worker pod remains stuck in a high-CPU state, and subsequent builds are also affected until the pod is manually restarted.

Environment: BuildKit Version: v0.24.0

Setup: A set of buildkitd workers deployed in a Kubernetes cluster.

Client Connection: We connect to the workers via a remote gRPC connection. Client: docker: 28.4.0

Diagnostics: Following the debugging steps, I was able to capture a full goroutine dump from a stuck buildkitd pod using the pprof endpoint (/debug/pprof/goroutine?debug=2).

The dump confirms that the process is in a deadlock or livelock state within the bbolt cache storage layer, which is consistent with the findings in this issue. The stack trace is filled with thousands of goroutines stuck in a state similar to this:  

goroutine 59752 [runnable]:
go.etcd.io/bbolt.(*Cursor).searchPage(0xc003066740, {0xc00270d7c0?, 0x47?, 0x50?}, 0x7f0a7d65a000?)
	/src/vendor/go.etcd.io/bbolt/cursor.go:324 +0x20c
go.etcd.io/bbolt.(*Cursor).search(0xc003066740, {0xc00270d7c0, 0x47, 0x50}, 0x4605e7?)
	/src/vendor/go.etcd.io/bbolt/cursor.go:301 +0x1b9
go.etcd.io/bbolt.(*Cursor).seek(0xc003066740, {0xc00270d7c0?, 0xc00270d7c0?, 0xc00270d7c0?})
	/src/vendor/go.etcd.io/bbolt/cursor.go:162 +0x2e
go.etcd.io/bbolt.(*Bucket).Bucket(0xc001e2c4c0, {0xc00270d7c0, 0x47, 0x1?})
	/src/vendor/go.etcd.io/bbolt/bucket.go:97 +0xb6
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load.func1(0xc000aa0248?)
	/src/solver/bboltcachestorage/storage.go:129 +0x165
go.etcd.io/bbolt.(*DB).View(0xd68d30?, 0xc0033450b0)
	/src/vendor/go.etcd.io/bbolt/db.go:939 +0x6c
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000a75890, {0xc002b1a2d0, 0x47}, {0xc000fb1f80, 0x34})
	/src/solver/bboltcachestorage/storage.go:124 +0xd3

I'm aware that v0.24.0 included fixes for potential deadlocks (specifically #6158 and fixes for BoltDB panics). However, this evidence suggests that a race condition or deadlock scenario still exists in the cache management layer.  

I am attaching the full goroutine dump for your analysis. Please let me know if there is any other information I can provide to help resolve this.

goroutine.log

Thank you.

maicondssiqueira avatar Sep 17 '25 12:09 maicondssiqueira

Our CI has been on docker/moby 28.5.0 (which includes all the fixes/patches referenced here) for a week now, with no further occurrences of the issue and no reason to think it will reappear. (The patch release of moby/buildkit from a couple days ago didn't appear relevant to our usage so we did not rush to upgrade to that version.)

If @zachspar and @maicondssiqueira have had a chance to upgrade and feel equally confident, I'll go ahead and close this issue on behalf of all of us.

rrjjvv avatar Oct 11 '25 03:10 rrjjvv