sporadic "forwarding Ping: no such job" errors in CI
We run our CI runners on k8s using docker-in-docker + buildx. Specifically these are GitLab CI runners. "Sometimes" docker builds fail with with:
stderr:
ERROR: NotFound: forwarding Ping: no such job 7zy6tgcxqlcljaw8fsd6xs4dm
I'm opening the issue against this repository since it is where the error string appears to reside: https://github.com/moby/buildkit/blob/8a287ce400aa0b41d9aef898e71e808b1f187357/control/gateway/gateway.go#L136
Unfortunately I have not found a clear pattern or repro case, and every time I search the only other English result is https://github.com/earthly/earthly/issues/3454
Version Information (from build pod):
$ uname -a
Linux runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce 6.1.94-99.176.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jun 18 14:57:56 UTC 2024 x86_64 GNU/Linux
$ docker info
Client: Docker Engine - Community
Version: 27.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.15.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.28.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 27.0.3
Storage Driver: overlayfs
driver-type: io.containerd.snapshotter.v1
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc version: v1.1.13-0-g58aa920
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.94-99.176.amzn2023.x86_64
Operating System: Alpine Linux v3.20 (containerized)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 240.2GiB
Name: runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce
ID: 51762a1f-2261-4fdb-84b8-c90c1445cbc1
Docker Root Dir: /builds/docker-data-root
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
[DEPRECATION NOTICE]: API is accessible on http://0.0.0.0:2375/ without encryption.
Access to the remote API is equivalent to root access on the host. Refer
to the 'Docker daemon attack surface' section in the documentation for
more information: https://docs.docker.com/go/attack-surface/
In future versions this will be a hard failure preventing the daemon from starting! Learn more at: https://docs.docker.com/go/api-security/
$ neofetch
_,met$$$$$gg.
,g$$$$$$$$$$$$$$$P.
,g$$P" """Y$$.".
,$$P' `$$$.
',$$P ,ggs. `$$b:
`d$$' ,$P"' . $$$
$$P d$' , $$P
$$: $$. - ,d$$'
$$; Y$b._ _,d$P'
Y$$. `.`"Y$$$$P"'
`$$b "-.__
`Y$$
`Y$$.
`$$b.
`Y$$b.
`"Y$b._
`"""
root@runner-qn7qyr8ex-project-40783171-concurrent-14-ksipf5ce
-------------------------------------------------------------
OS: Debian GNU/Linux 12 (bookworm) x86_64
Host: HVM domU 4.11.amazon
Kernel: 6.1.94-99.176.amzn2023.x86_64
Uptime: 3 mins
Packages: 456 (dpkg)
Shell: bash 5.2.15
CPU: Intel Xeon E5-2670 v2 (32) @ 2.493GHz
Memory: 4793MiB / 245942MiB
We have seen this error many times in our CI docker builds as well. It appears to occur when the hosting VM is low on system resources.
I also encountered this error when I docker compose up more than 600 containers on a single machine.
Can we extend the timeout here?
https://github.com/moby/buildkit/blob/8a287ce400aa0b41d9aef898e71e808b1f187357/control/gateway/gateway.go#L63