buildkit sporadic "forwarding Ping: no such job" errors in CI

We run our CI runners on k8s using docker-in-docker + buildx. Specifically these are GitLab CI runners. "Sometimes" docker builds fail with with:

stderr:

ERROR: NotFound: forwarding Ping: no such job 7zy6tgcxqlcljaw8fsd6xs4dm

I'm opening the issue against this repository since it is where the error string appears to reside: https://github.com/moby/buildkit/blob/8a287ce400aa0b41d9aef898e71e808b1f187357/control/gateway/gateway.go#L136

Unfortunately I have not found a clear pattern or repro case, and every time I search the only other English result is https://github.com/earthly/earthly/issues/3454

Version Information (from build pod):

$ uname -a
Linux runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce 6.1.94-99.176.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jun 18 14:57:56 UTC 2024 x86_64 GNU/Linux
$ docker info
Client: Docker Engine - Community
 Version:    27.0.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.15.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.28.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 27.0.3
 Storage Driver: overlayfs
  driver-type: io.containerd.snapshotter.v1
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc version: v1.1.13-0-g58aa920
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.94-99.176.amzn2023.x86_64
 Operating System: Alpine Linux v3.20 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 240.2GiB
 Name: runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce
 ID: 51762a1f-2261-4fdb-84b8-c90c1445cbc1
 Docker Root Dir: /builds/docker-data-root
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
[DEPRECATION NOTICE]: API is accessible on http://0.0.0.0:2375/ without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/go/attack-surface/
In future versions this will be a hard failure preventing the daemon from starting! Learn more at: https://docs.docker.com/go/api-security/
$ neofetch
       _,met$$$$$gg.
    ,g$$$$$$$$$$$$$$$P.
  ,g$$P"     """Y$$.".
 ,$$P'              `$$$.
',$$P       ,ggs.     `$$b:
`d$$'     ,$P"'   .    $$$
 $$P      d$'     ,    $$P
 $$:      $$.   -    ,d$$'
 $$;      Y$b._   _,d$P'
 Y$$.    `.`"Y$$$$P"'
 `$$b      "-.__
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""
root@runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce 
------------------------------------------------------------- 
OS: Debian GNU/Linux 12 (bookworm) x86_64 
Host: HVM domU 4.11.amazon 
Kernel: 6.1.94-99.176.amzn2023.x86_64 
Uptime: 3 mins 
Packages: 456 (dpkg) 
Shell: bash 5.2.15 
CPU: Intel Xeon E5-2670 v2 (32) @ 2.493GHz 
Memory: 4793MiB / 245942MiB

Jul 18 '24 22:07 cburroughs

We have seen this error many times in our CI docker builds as well. It appears to occur when the hosting VM is low on system resources.

Sep 03 '24 18:09 mispencer

I also encountered this error when I docker compose up more than 600 containers on a single machine. Can we extend the timeout here?

https://github.com/moby/buildkit/blob/8a287ce400aa0b41d9aef898e71e808b1f187357/control/gateway/gateway.go#L63

Sep 15 '24 06:09 fourdim