buildx icon indicating copy to clipboard operation
buildx copied to clipboard

[BUG] building w/ compose bake always results in `NotFound desc = no such job xxxxxxxxx` errors

Open nocive opened this issue 6 months ago • 16 comments

Description

I don't have a minimal reproduction scenario at this point, but we've tried to switch to compose bake as the build engine for our existing docker compose project for a while and we always get the same type of error coming from a random container.

The errors look as follows:

08:43:09  docker/compose#141 exporting layers 0.4s done
08:43:12  target some-container: failed to receive status: rpc error: code = NotFound desc = no such job xxxxe8yk4gxypivxclixworqh

We've tried all docker compose versions since the feature is available (including 2.37.1 from a day ago) and all of them produce the same results.

Any idea what it could be or how we can help diagnose it?

Steps To Reproduce

  1. enable bake as the build engine w/ COMPOSE_BAKE=true
  2. build a previously working project w/ docker compose build --with-dependencies <A LIST OF CONTAINERS>
  3. see it error out on an arbitrary container with the following error: target some-container: failed to receive status: rpc error: code = NotFound desc = no such job xxxxe8yk4gxypivxclixworqh

Compose Version

2.36.1
2.36.2
2.37.0
2.37.1

Docker Environment

Client: Docker Engine - Community
 Version:    28.2.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.24.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.37.1
    Path:     /home/xxx/.docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 139
 Server Version: 28.2.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-31-cloud-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 14.27GiB
 Name: xxxx
 ID: 20e43fc1-f756-441c-aa46-ab1a663b48ae
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: sisubot
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Note: Error also occurred with older 27.x versions.

Anything else?

No response

nocive avatar Jun 13 '25 07:06 nocive

Please run docker compose build <...args> --print and capture output into a bake.json file Then run docker buildx bake -f bake.json to double check this same error applies when directly running bake.

ndeloof avatar Jun 13 '25 09:06 ndeloof

@ndeloof thanks for the guidance.

And yes, the same error occurs when directly running bake.

nocive avatar Jun 17 '25 07:06 nocive

With debug enabled, I get the following additional (and potentially useful) output:

10:01:28  5340 v0.24.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
10:01:28  github.com/docker/buildx/build.BuildWithResultHandler.func2.5.2
10:01:28  	github.com/docker/buildx/build/build.go:584
10:01:28  github.com/docker/buildx/build.BuildWithResultHandler.func2.5
10:01:28  	github.com/docker/buildx/build/build.go:590
10:01:28  golang.org/x/sync/errgroup.(*Group).Go.func1
10:01:28  	golang.org/x/[email protected]/errgroup/errgroup.go:79
10:01:28  runtime.goexit
10:01:28  	runtime/asm_amd64.s:1700
10:01:28  
10:01:28  5340 v0.24.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
10:01:28  github.com/moby/buildkit/client.(*Client).solve.func4
10:01:28  	github.com/moby/[email protected]/client/solve.go:328
10:01:28  golang.org/x/sync/errgroup.(*Group).Go.func1
10:01:28  	golang.org/x/[email protected]/errgroup/errgroup.go:79

nocive avatar Jun 17 '25 08:06 nocive

cc @crazy-max

ndeloof avatar Jun 17 '25 08:06 ndeloof

Can you give a minimal repro with Dockerfile(s) and Compose file?

Also what's the output of docker info

crazy-max avatar Jun 17 '25 08:06 crazy-max

Can you give a minimal repro with Dockerfile(s) and Compose file?

Not really, not at this point. We're only able to reproduce this with our project (which I sadly can't share) and in our CI environments.

Also what's the output of docker info

Client: Docker Engine - Community
 Version:    28.2.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.24.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.37.1
    Path:     /home/xxx/.docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 139
 Server Version: 28.2.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-31-cloud-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 14.27GiB
 Name: xxxx
 ID: 20e43fc1-f756-441c-aa46-ab1a663b48ae
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: sisubot
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

nocive avatar Jun 17 '25 08:06 nocive

@crazy-max can you please transfert this issue to buildx repository ?

ndeloof avatar Jul 08 '25 15:07 ndeloof

Issue persists with latest version at the time of writing (2.39.2).

nocive avatar Aug 15 '25 12:08 nocive

Error persists in newer docker versions, here's an updated stack trace from docker 28.4.0:

15:22:03  ERROR: target some-app: NotFound: forwarding Ping: no such job uz44bf3z90k09tbt9clzlqpcc
15:22:03  714  /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
15:22:03  github.com/moby/buildkit/control/gateway.(*GatewayForwarder).Ping
15:22:03  	/root/build-deb/engine/vendor/github.com/moby/buildkit/control/gateway/gateway.go:136
15:22:03  github.com/moby/buildkit/frontend/gateway/pb._LLBBridge_Ping_Handler.func1
15:22:03  	/root/build-deb/engine/vendor/github.com/moby/buildkit/frontend/gateway/pb/gateway_grpc.pb.go:455
15:22:03  google.golang.org/grpc.getChainUnaryHandler.func1
15:22:03  	/root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1217
15:22:03  github.com/docker/docker/api/server/router/grpc.unaryInterceptor
15:22:03  	/root/build-deb/engine/api/server/router/grpc/grpc.go:71
15:22:03  google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1
15:22:03  	/root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1208
15:22:03  github.com/moby/buildkit/frontend/gateway/pb._LLBBridge_Ping_Handler
15:22:03  	/root/build-deb/engine/vendor/github.com/moby/buildkit/frontend/gateway/pb/gateway_grpc.pb.go:457
15:22:03  google.golang.org/grpc.(*Server).processUnaryRPC
15:22:03  	/root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1405
15:22:03  google.golang.org/grpc.(*Server).handleStream
15:22:03  	/root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1815
15:22:03  google.golang.org/grpc.(*Server).serveStreams.func2.1
15:22:03  	/root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1035
15:22:03  runtime.goexit
15:22:03  	/usr/local/go/src/runtime/asm_amd64.s:1700
15:22:03  
15:22:03  3330 v0.27.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
15:22:03  google.golang.org/grpc.(*ClientConn).Invoke
15:22:03  	google.golang.org/[email protected]/call.go:35
15:22:03  github.com/moby/buildkit/frontend/gateway/pb.(*lLBBridgeClient).Ping
15:22:03  	github.com/moby/[email protected]/frontend/gateway/pb/gateway_grpc.pb.go:148
15:22:03  github.com/moby/buildkit/client.(*gatewayClientForBuild).Ping
15:22:03  	github.com/moby/[email protected]/client/build.go:143
15:22:03  github.com/moby/buildkit/frontend/gateway/grpcclient.New
15:22:03  	github.com/moby/[email protected]/frontend/gateway/grpcclient/client.go:49
15:22:03  github.com/moby/buildkit/client.(*Client).Build.func2
15:22:03  	github.com/moby/[email protected]/client/build.go:51
15:22:03  github.com/moby/buildkit/client.(*Client).solve.func3
15:22:03  	github.com/moby/[email protected]/client/solve.go:305
15:22:03  golang.org/x/sync/errgroup.(*Group).add.func1
15:22:03  	golang.org/x/[email protected]/errgroup/errgroup.go:130
15:22:03  runtime.goexit
15:22:03  	runtime/asm_amd64.s:1700
15:22:03  
15:22:03  3330 v0.27.0 /usr/libexec/docker/cli-plugins/docker-buildx buildx -D bake -f bake.json
15:22:03  github.com/docker/buildx/build.BuildWithResultHandler.func1.5.2
15:22:03  	github.com/docker/buildx/build/build.go:635
15:22:03  github.com/docker/buildx/build.BuildWithResultHandler.func1.5
15:22:03  	github.com/docker/buildx/build/build.go:641
15:22:03  golang.org/x/sync/errgroup.(*Group).add.func1
15:22:03  	golang.org/x/[email protected]/errgroup/errgroup.go:130

nocive avatar Oct 24 '25 13:10 nocive

@ndeloof @thaJeztah is there something else that can be done to raise attention to this? Is there an issue on docker side where this is being tracked?

I'm concerned about the deprecation (and subsequent removal) of the legacy build engine when the bake build engine is still completely unusable for us.

nocive avatar Oct 24 '25 13:10 nocive

This is a buildx issue. @crazy-max can you please transfert to buildx repo?

ndeloof avatar Oct 24 '25 14:10 ndeloof

@ndeloof Don't have perms to transfer

@nocive Do you have a minimal repro? Can you also show the output of docker buildx inspect?

crazy-max avatar Oct 24 '25 14:10 crazy-max

@crazy-max I gave you RW permission

ndeloof avatar Oct 24 '25 14:10 ndeloof

@crazy-max

Sadly I don't have a minimal repro, our project is a complex project with over 120 containers which makes it incredibly difficult and time consuming to anonymize. Additionally, the issue seems to be a race condition that only occurs under certain circumstances, since I was never able to reproduce it successfully in my local environment. It does occur quite consistently in our CI executions though where the project is built before the actual execution of tests.

Ouput of docker buildx inspect as requested:

Name:          default
Driver:        docker
Last Activity: 2025-10-24 14:52:09 +0000 UTC

Nodes:
Name:             default
Endpoint:         default
Status:           running
BuildKit version: v0.25.1
Platforms:        linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/386
Labels:
 org.mobyproject.buildkit.worker.moby.host-gateway-ip: 172.17.0.1
GC Policy rule#0:
 All:            false
 Filters:        type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration:  48h0m0s
 Max Used Space: 2.059GiB
GC Policy rule#1:
 All:            false
 Keep Duration:  1440h0m0s
 Reserved Space: 14.9GiB
 Max Used Space: 117.3GiB
 Min Free Space: 29.8GiB
GC Policy rule#2:
 All:            false
 Reserved Space: 14.9GiB
 Max Used Space: 117.3GiB
 Min Free Space: 29.8GiB
GC Policy rule#3:
 All:            true
 Reserved Space: 14.9GiB
 Max Used Space: 117.3GiB
 Min Free Space: 29.8GiB

nocive avatar Oct 24 '25 15:10 nocive

FWIW I've asked AI 🙈 to theorize about a possible cause, here's what I got as a reply:

Based on the provided stack trace and the complexity of the Docker Compose project, here is a hypothesis for the root cause of the error.

  Hypothesis: Premature Build Job Termination due to a Race Condition Under High Load

  The error message NotFound: forwarding Ping: no such job indicates that the docker buildx client sent a keep-alive "Ping" for a specific build job, but the BuildKit daemon (part of dockerd) 
  no longer had any record of that job. This points to a race condition where the build job was terminated on the daemon side while the client still believed it was running.

  Here is the likely sequence of events:

   1. High Parallelism: The docker buildx bake command attempts to build the numerous services defined across your two docker-compose.yml files. By default, it runs many of these builds in 
      parallel to maximize speed.
   2. Resource Contention: Your project is exceptionally large, with dozens of services, many of which have complex dependencies (depends_on) and are built from source. This high degree of 
      parallelism likely causes extreme resource contention on the host machine, leading to CPU saturation, memory exhaustion (OOM), or I/O bottlenecks.
   3. Job Termination: Under this heavy load, a specific build job (uz44bf3z90k09tbt9clzlqpcc) is prematurely and unexpectedly terminated. This could be due to:
       * The system's Out-Of-Memory (OOM) killer terminating the process.
       * An internal timeout within the BuildKit scheduler, which gives up on an unresponsive build process.
       * A crash within the specific build container itself due to the stressful conditions.
   4. State Desynchronization (The Race Condition): The BuildKit daemon registers the termination and cleans up the job's resources, removing it from its list of active builds. However, the 
      buildx client is not immediately notified of this termination.
   5. Failed Ping: The client, operating with a stale state, sends a routine gRPC Ping request to the daemon to check on the status of the job it thinks is still active.
   6. Error Response: The daemon, having already removed the job from its records, cannot find it and correctly responds with the NotFound: no such job error, causing the entire bake process to 
      fail.

  In essence, the massive scale and complexity of your Docker Compose project create a high-stress environment that exposes a gap between a build job failing on the daemon and the client 
  being notified, leading to this specific error.

I think that 2. and 3. are good points and worth investigating, especially considering this only seems to be consistently reproducible for us in our CI environment.

I'm wondering if there's a way to limit parallelism in docker bake to potentially mitigate the situation. I know it's possible to do it in docker compose, but I couldn't find anything specific to bake.

Let me know if there's anything else I can do to help pin this down. 🙏

nocive avatar Oct 29 '25 16:10 nocive

Maybe related 🤷‍♂️ https://github.com/docker/buildx/issues/359

nocive avatar Nov 12 '25 17:11 nocive