[BUG] building w/ compose bake always results in `NotFound desc = no such job xxxxxxxxx` errors
Description
I don't have a minimal reproduction scenario at this point, but we've tried to switch to compose bake as the build engine for our existing docker compose project for a while and we always get the same type of error coming from a random container.
The errors look as follows:
08:43:09 docker/compose#141 exporting layers 0.4s done
08:43:12 target some-container: failed to receive status: rpc error: code = NotFound desc = no such job xxxxe8yk4gxypivxclixworqh
We've tried all docker compose versions since the feature is available (including 2.37.1 from a day ago) and all of them produce the same results.
Any idea what it could be or how we can help diagnose it?
Steps To Reproduce
- enable bake as the build engine w/
COMPOSE_BAKE=true - build a previously working project w/
docker compose build --with-dependencies <A LIST OF CONTAINERS> - see it error out on an arbitrary container with the following error:
target some-container: failed to receive status: rpc error: code = NotFound desc = no such job xxxxe8yk4gxypivxclixworqh
Compose Version
2.36.1
2.36.2
2.37.0
2.37.1
Docker Environment
Client: Docker Engine - Community
Version: 28.2.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.24.0
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.37.1
Path: /home/xxx/.docker/cli-plugins/docker-compose
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 139
Server Version: 28.2.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.5-0-g59923ef
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.10.0-31-cloud-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.27GiB
Name: xxxx
ID: 20e43fc1-f756-441c-aa46-ab1a663b48ae
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: sisubot
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Live Restore Enabled: false
Note: Error also occurred with older 27.x versions.
Anything else?
No response
Please run docker compose build <...args> --print and capture output into a bake.json file
Then run docker buildx bake -f bake.json to double check this same error applies when directly running bake.
@ndeloof thanks for the guidance.
And yes, the same error occurs when directly running bake.
With debug enabled, I get the following additional (and potentially useful) output:
10:01:28 5340 v0.24.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
10:01:28 github.com/docker/buildx/build.BuildWithResultHandler.func2.5.2
10:01:28 github.com/docker/buildx/build/build.go:584
10:01:28 github.com/docker/buildx/build.BuildWithResultHandler.func2.5
10:01:28 github.com/docker/buildx/build/build.go:590
10:01:28 golang.org/x/sync/errgroup.(*Group).Go.func1
10:01:28 golang.org/x/[email protected]/errgroup/errgroup.go:79
10:01:28 runtime.goexit
10:01:28 runtime/asm_amd64.s:1700
10:01:28
10:01:28 5340 v0.24.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
10:01:28 github.com/moby/buildkit/client.(*Client).solve.func4
10:01:28 github.com/moby/[email protected]/client/solve.go:328
10:01:28 golang.org/x/sync/errgroup.(*Group).Go.func1
10:01:28 golang.org/x/[email protected]/errgroup/errgroup.go:79
cc @crazy-max
Can you give a minimal repro with Dockerfile(s) and Compose file?
Also what's the output of docker info
Can you give a minimal repro with Dockerfile(s) and Compose file?
Not really, not at this point. We're only able to reproduce this with our project (which I sadly can't share) and in our CI environments.
Also what's the output of docker info
Client: Docker Engine - Community
Version: 28.2.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.24.0
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.37.1
Path: /home/xxx/.docker/cli-plugins/docker-compose
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 139
Server Version: 28.2.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.5-0-g59923ef
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.10.0-31-cloud-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.27GiB
Name: xxxx
ID: 20e43fc1-f756-441c-aa46-ab1a663b48ae
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: sisubot
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Live Restore Enabled: false
@crazy-max can you please transfert this issue to buildx repository ?
Issue persists with latest version at the time of writing (2.39.2).
Error persists in newer docker versions, here's an updated stack trace from docker 28.4.0:
15:22:03 ERROR: target some-app: NotFound: forwarding Ping: no such job uz44bf3z90k09tbt9clzlqpcc
15:22:03 714 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
15:22:03 github.com/moby/buildkit/control/gateway.(*GatewayForwarder).Ping
15:22:03 /root/build-deb/engine/vendor/github.com/moby/buildkit/control/gateway/gateway.go:136
15:22:03 github.com/moby/buildkit/frontend/gateway/pb._LLBBridge_Ping_Handler.func1
15:22:03 /root/build-deb/engine/vendor/github.com/moby/buildkit/frontend/gateway/pb/gateway_grpc.pb.go:455
15:22:03 google.golang.org/grpc.getChainUnaryHandler.func1
15:22:03 /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1217
15:22:03 github.com/docker/docker/api/server/router/grpc.unaryInterceptor
15:22:03 /root/build-deb/engine/api/server/router/grpc/grpc.go:71
15:22:03 google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1
15:22:03 /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1208
15:22:03 github.com/moby/buildkit/frontend/gateway/pb._LLBBridge_Ping_Handler
15:22:03 /root/build-deb/engine/vendor/github.com/moby/buildkit/frontend/gateway/pb/gateway_grpc.pb.go:457
15:22:03 google.golang.org/grpc.(*Server).processUnaryRPC
15:22:03 /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1405
15:22:03 google.golang.org/grpc.(*Server).handleStream
15:22:03 /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1815
15:22:03 google.golang.org/grpc.(*Server).serveStreams.func2.1
15:22:03 /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1035
15:22:03 runtime.goexit
15:22:03 /usr/local/go/src/runtime/asm_amd64.s:1700
15:22:03
15:22:03 3330 v0.27.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
15:22:03 google.golang.org/grpc.(*ClientConn).Invoke
15:22:03 google.golang.org/[email protected]/call.go:35
15:22:03 github.com/moby/buildkit/frontend/gateway/pb.(*lLBBridgeClient).Ping
15:22:03 github.com/moby/[email protected]/frontend/gateway/pb/gateway_grpc.pb.go:148
15:22:03 github.com/moby/buildkit/client.(*gatewayClientForBuild).Ping
15:22:03 github.com/moby/[email protected]/client/build.go:143
15:22:03 github.com/moby/buildkit/frontend/gateway/grpcclient.New
15:22:03 github.com/moby/[email protected]/frontend/gateway/grpcclient/client.go:49
15:22:03 github.com/moby/buildkit/client.(*Client).Build.func2
15:22:03 github.com/moby/[email protected]/client/build.go:51
15:22:03 github.com/moby/buildkit/client.(*Client).solve.func3
15:22:03 github.com/moby/[email protected]/client/solve.go:305
15:22:03 golang.org/x/sync/errgroup.(*Group).add.func1
15:22:03 golang.org/x/[email protected]/errgroup/errgroup.go:130
15:22:03 runtime.goexit
15:22:03 runtime/asm_amd64.s:1700
15:22:03
15:22:03 3330 v0.27.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
15:22:03 github.com/docker/buildx/build.BuildWithResultHandler.func1.5.2
15:22:03 github.com/docker/buildx/build/build.go:635
15:22:03 github.com/docker/buildx/build.BuildWithResultHandler.func1.5
15:22:03 github.com/docker/buildx/build/build.go:641
15:22:03 golang.org/x/sync/errgroup.(*Group).add.func1
15:22:03 golang.org/x/[email protected]/errgroup/errgroup.go:130
@ndeloof @thaJeztah is there something else that can be done to raise attention to this? Is there an issue on docker side where this is being tracked?
I'm concerned about the deprecation (and subsequent removal) of the legacy build engine when the bake build engine is still completely unusable for us.
This is a buildx issue. @crazy-max can you please transfert to buildx repo?
@ndeloof Don't have perms to transfer
@nocive Do you have a minimal repro? Can you also show the output of docker buildx inspect?
@crazy-max I gave you RW permission
@crazy-max
Sadly I don't have a minimal repro, our project is a complex project with over 120 containers which makes it incredibly difficult and time consuming to anonymize. Additionally, the issue seems to be a race condition that only occurs under certain circumstances, since I was never able to reproduce it successfully in my local environment. It does occur quite consistently in our CI executions though where the project is built before the actual execution of tests.
Ouput of docker buildx inspect as requested:
Name: default
Driver: docker
Last Activity: 2025-10-24 14:52:09 +0000 UTC
Nodes:
Name: default
Endpoint: default
Status: running
BuildKit version: v0.25.1
Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
Labels:
org.mobyproject.buildkit.worker.moby.host-gateway-ip: 172.17.0.1
GC Policy rule#0:
All: false
Filters: type==source.local,type==exec.cachemount,type==source.git.checkout
Keep Duration: 48h0m0s
Max Used Space: 2.059GiB
GC Policy rule#1:
All: false
Keep Duration: 1440h0m0s
Reserved Space: 14.9GiB
Max Used Space: 117.3GiB
Min Free Space: 29.8GiB
GC Policy rule#2:
All: false
Reserved Space: 14.9GiB
Max Used Space: 117.3GiB
Min Free Space: 29.8GiB
GC Policy rule#3:
All: true
Reserved Space: 14.9GiB
Max Used Space: 117.3GiB
Min Free Space: 29.8GiB
FWIW I've asked AI 🙈 to theorize about a possible cause, here's what I got as a reply:
Based on the provided stack trace and the complexity of the Docker Compose project, here is a hypothesis for the root cause of the error.
Hypothesis: Premature Build Job Termination due to a Race Condition Under High Load
The error message NotFound: forwarding Ping: no such job indicates that the docker buildx client sent a keep-alive "Ping" for a specific build job, but the BuildKit daemon (part of dockerd)
no longer had any record of that job. This points to a race condition where the build job was terminated on the daemon side while the client still believed it was running.
Here is the likely sequence of events:
1. High Parallelism: The docker buildx bake command attempts to build the numerous services defined across your two docker-compose.yml files. By default, it runs many of these builds in
parallel to maximize speed.
2. Resource Contention: Your project is exceptionally large, with dozens of services, many of which have complex dependencies (depends_on) and are built from source. This high degree of
parallelism likely causes extreme resource contention on the host machine, leading to CPU saturation, memory exhaustion (OOM), or I/O bottlenecks.
3. Job Termination: Under this heavy load, a specific build job (uz44bf3z90k09tbt9clzlqpcc) is prematurely and unexpectedly terminated. This could be due to:
* The system's Out-Of-Memory (OOM) killer terminating the process.
* An internal timeout within the BuildKit scheduler, which gives up on an unresponsive build process.
* A crash within the specific build container itself due to the stressful conditions.
4. State Desynchronization (The Race Condition): The BuildKit daemon registers the termination and cleans up the job's resources, removing it from its list of active builds. However, the
buildx client is not immediately notified of this termination.
5. Failed Ping: The client, operating with a stale state, sends a routine gRPC Ping request to the daemon to check on the status of the job it thinks is still active.
6. Error Response: The daemon, having already removed the job from its records, cannot find it and correctly responds with the NotFound: no such job error, causing the entire bake process to
fail.
In essence, the massive scale and complexity of your Docker Compose project create a high-stress environment that exposes a gap between a build job failing on the daemon and the client
being notified, leading to this specific error.
I think that 2. and 3. are good points and worth investigating, especially considering this only seems to be consistently reproducible for us in our CI environment.
I'm wondering if there's a way to limit parallelism in docker bake to potentially mitigate the situation. I know it's possible to do it in docker compose, but I couldn't find anything specific to bake.
Let me know if there's anything else I can do to help pin this down. 🙏
Maybe related 🤷♂️ https://github.com/docker/buildx/issues/359