buildkit
buildkit copied to clipboard
buildkit dies when multiple builds are happening with cache mounts
Our builds will frequently fail when multiple builds are happening that use cache mounts. Here's the error output:
#2 [internal] load .dockerignore
#2 DONE 0.0s
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile:
#1 transferring dockerfile: 479B done
#1 DONE 0.4s
#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.5s
#3 resolve image config for docker.io/docker/dockerfile:experimental
#3 DONE 0.4s
#4 docker-image://docker.io/docker/dockerfile:experimental@sha256:787107d7f...
#4 CACHED
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: frontend grpc server closed unexpectedly
docker --version
: Docker version 19.03.6-ce, build 369ce74
uname -a
: Linux ip-___.us-west-2.compute.internal 4.14.173-137.228.amzn2.x86_64 #1 SMP Thu Mar 19 16:50:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Please provide reproducer. Might have something more useful in the daemon logs.
Have multiple Dockerfile
s that are built at the same time. Even something simple like this will make it happen:
# syntax = docker/dockerfile:experimental
FROM node:12.16.2-alpine3.11
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
COPY package.json yarn.lock /usr/src/app/
RUN --mount=type=cache,target=/usr/local/share/.cache/yarn \
yarn install --frozen-lockfile --link-duplicates
COPY . /usr/src/app
CMD ["./my_cmd.js"]
Here's the daemon logs from the minute when this build was running:
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.194475673Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22472 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22472 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22475 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22475 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22470 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22470 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22473 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22473 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22474 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22474 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Started Session 22476 of user root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Starting Session 22476 of user root.
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.239600278Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.271664253Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 systemd: Removed slice User Slice of root.
Apr 22 23:50:01 ip-172-31-19-202 systemd: Stopping User Slice of root.
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.326422207Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.355288037Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.411010917Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.438537467Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.501407456Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.512906894Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.585573840Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.601737116Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.674449393Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.698042067Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.754967464Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.765533013Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.841606813Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.861328732Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.923336874Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:01 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:01.937540697Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.011004176Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.030430398Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.101620492Z" level=error msg="failed to kill runc t7yvihenzfcwuz050i3xk5ic6: runc did not terminate sucessfully: container \"t7yvihenzfcwuz050i3xk5ic6\" does not exist\n"
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.117109639Z" level=error msg="failed to kill runc nwa02qf8fljmk7ynadcolmpkd: runc did not terminate sucessfully: container \"nwa02qf8fljmk7ynadcolmpkd\" does not exist\n"
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.521561749Z" level=warning msg="grpc: addrConn.createTransport failed to connect to { 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.579781878Z" level=info msg="parsed scheme: \"\"" module=grpc
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.579841212Z" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.579876699Z" level=info msg="ccResolverWrapper: sending update to cc: {[{ 0 <nil>}] <nil>}" module=grpc
Apr 22 23:50:02 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:02.579888754Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Apr 22 23:50:13 ip-172-31-19-202 dhclient[3431]: XMT: Solicit on eth0, interval 123080ms.
Apr 22 23:50:42 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:42.046201787Z" level=warning msg="grpc: addrConn.createTransport failed to connect to { 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc
Apr 22 23:50:42 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:42.055162882Z" level=warning msg="grpc: addrConn.createTransport failed to connect to { 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc
Apr 22 23:50:44 ip-172-31-19-202 dockerd: time="2020-04-22T23:50:44.597299072Z" level=warning msg="grpc: addrConn.createTransport failed to connect to { 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc
I haven't tested this thoroughly but I think I've found a workaround. If you specify the sha hash of the version of dockerfile you use it won't reach out to the registry to check for a new version. For instance when I updated ours to:
#syntax=docker/dockerfile:1.2@sha256:e2a8561e419ab1ba6b2fe6cbdf49fd92b95912df1cf7d313c3e2230a333fdbcc
we stopped getting the error
We have been hitting this too, quite frequently.
One thing I notice in our case is the 10 second gap between these log messages:
time="2022-07-07T18:51:23Z" level=debug msg="> creating lo7ga74jp5tibb3gpinkpjavf [/bin/dockerfile-frontend]" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
...
time="2022-07-07T18:51:33Z" level=error msg="failed to kill runc lo7ga74jp5tibb3gpinkpjavf: runc did not terminate successfully: exit status 1: container does not exist\n" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
That makes me suspect we are hitting this 10 second timeout waiting for container start: https://github.com/moby/buildkit/blob/d21254e7f74b49a84c3fbe1b13260cb23954cf92/executor/runcexecutor/executor.go#L491
Then I suspect this is where the error gets converted to a context cancelled error: https://github.com/moby/buildkit/blob/d21254e7f74b49a84c3fbe1b13260cb23954cf92/executor/runcexecutor/executor.go#L353-L355
...which later gets replaced by a frontend grpc server closed unexpectedly
error: https://github.com/moby/buildkit/blob/d21254e7f74b49a84c3fbe1b13260cb23954cf92/frontend/gateway/gateway.go#L280-L282
Here is the complete log:
time="2022-07-07T18:51:23Z" level=debug msg="session started" spanID=0579e4b20ad1322c traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=resolving host=registry-1.docker.io spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg="do request" host=registry-1.docker.io request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=buildkit/0.0.0+unknown request.method=HEAD spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c url="https://registry-1.docker.io/v2/docker/dockerfile/manifests/latest"
time="2022-07-07T18:51:23Z" level=debug msg="fetch response received" host=registry-1.docker.io response.header.content-length=2004 response.header.content-type=application/vnd.docker.distribution.manifest.list.v2+json response.header.date="Thu, 07 Jul 2022 18:51:23 GMT" response.header.docker-content-digest="sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=34.216.67.234 response.header.etag="\"sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2\"" response.header.strict-transport-security="max-age=31536000" response.status="200 OK" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c url="https://registry-1.docker.io/v2/docker/dockerfile/manifests/latest"
time="2022-07-07T18:51:23Z" level=debug msg=resolved desc.digest="sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2" host=registry-1.docker.io spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2" mediatype=application/vnd.docker.distribution.manifest.list.v2+json size=2004 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:24d064a369eda7bc7839b6c1c227eac7212d06ca09a8235a4bed467f8acf180d" mediatype=application/vnd.docker.distribution.manifest.v2+json size=528 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:84495a15555de1a8f4738f58268fa8949547068198f8d0fa2a3e3a693d7f923f" mediatype=application/vnd.docker.container.image.v1+json size=2367 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2" mediatype=application/vnd.docker.distribution.manifest.list.v2+json size=2004 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:24d064a369eda7bc7839b6c1c227eac7212d06ca09a8235a4bed467f8acf180d" mediatype=application/vnd.docker.distribution.manifest.v2+json size=528 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg=fetch digest="sha256:84495a15555de1a8f4738f58268fa8949547068198f8d0fa2a3e3a693d7f923f" mediatype=application/vnd.docker.container.image.v1+json size=2367 spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg="> creating lo7ga74jp5tibb3gpinkpjavf [/bin/dockerfile-frontend]" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:23Z" level=debug msg="load cache for docker-image://docker.io/docker/dockerfile:latest@sha256:443aab4ca21183e069e7d8b2dc68006594f40bddf1b15bbd83f5137bd93e80e2 with 0x0qarovme3euq7zb0eqrliix::pkplr610nds1qlv63rrbv7qvh"
time="2022-07-07T18:51:23Z" level=debug msg="serving grpc connection" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
time="2022-07-07T18:51:24Z" level=debug msg="session finished: <nil>" spanID=4bacc2896edd10f0 traceID=b9221c6cf0781e75c29b1a6565e5f9d6
time="2022-07-07T18:51:27Z" level=debug msg="session started" spanID=9cd035dd0c96cffa traceID=eb08f4b216042ef9ffe65be563f1b1f8
time="2022-07-07T18:51:33Z" level=error msg="failed to kill runc lo7ga74jp5tibb3gpinkpjavf: runc did not terminate successfully: exit status 1: container does not exist\n" spanID=01dffb47aec600f9 traceID=e978a44f1e7576e83d9c6181cc26a77c
... above message repeats many times ...
Not sure if this timeout is too short for some reason, or if there is a registry-related issue (as some comments above suggest) which somehow makes the container not start up within 10 seconds.
In my case this seems to have ultimately been caused by what looks like a kernel bug with hanging mount syscalls. I was able to reproduce with standalone runc while the system was under load. Sometimes, bind mount syscalls that runc did to set up the container fs would hang for around two minutes:
[pid 20935] 00:19:02.046500 mount("/mnt/buildkit/runc-overlayfs/executor/resolv.conf", "/proc/self/fd/6", 0xc0000e5485, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND|MS_REC, NULL <unfinished ...>
...
[pid 20935] 00:21:11.272553 <... mount resumed> ) = 0
dmesg would show a "blocked task" corresponding to this:
[69359.844537] INFO: task runc:[2:INIT]:28744 blocked for more than 120 seconds.
[69359.848149] Tainted: P OE 5.4.0-1080-aws #87~18.04.1-Ubuntu
[69359.851565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[69359.855136] runc:[2:INIT] D 0 28744 28735 0x80004003
[69359.855138] Call Trace:
[69359.855146] __schedule+0x292/0x720
[69359.855148] schedule+0x33/0xa0
[69359.855151] rwsem_down_write_slowpath+0x23d/0x4c0
[69359.855153] down_write+0x3d/0x40
[69359.855154] ? down_write+0x3d/0x40
[69359.855158] do_mount+0x475/0x9f0
[69359.855159] ksys_mount+0x98/0xe0
[69359.855161] __x64_sys_mount+0x25/0x30
[69359.855164] do_syscall_64+0x57/0x190
[69359.855165] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[69359.855172] RIP: 0033:0x4b234a
[69359.855174] Code: 89 c8 48 89 f1 e8 d6 b8 fb ff 48 8b 8c 24 e8 00 00 00 48 8b 89 f8 00 00 00 48 8b 84 24 48 02 00 00 ff d1 48 89 84 24 c0 02 00 <00> 48 89 9c 24 68 01 00 00 48 8b 8c 24 e8 02 00 00 48 8b 89 f8 00
[69359.855175] RSP: 002b:000000c000203240 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[69359.855176] RAX: ffffffffffffffda RBX: 000000c00002e000 RCX: 00000000004b234a
[69359.855176] RDX: 000000c0000e5485 RSI: 000000c0000e5510 RDI: 000000c000150280
[69359.855177] RBP: 000000c0002032e0 R08: 0000000000000000 R09: 0000000000000000
[69359.855178] R10: 000000000000502f R11: 0000000000000202 R12: 000000c0000e5480
[69359.855178] R13: 0000000000000000 R14: 000000c0000021a0 R15: ffffffffffffffff
runc being blocked like this triggers the 10 second timeout on frontend startup (and also an hardcoded 10 second timeout in the HTTP/2 server, waiting for the preface from the frontend). It triggers a loop that keeps trying to send SIGKILL to runc and creates a lot of noise in the log.
I've since upgraded the kernel from 5.4.0 to 5.13.0 and have not seen this reoccur yet, though it's only been a few days.
@aaronlehmann HI! I also constantly have this problem at many parallel docker build commands. Could you please confirm if update to 5.13.0 kernel really helps ?
@Dzenly: Upgrading to 5.13.0 solved the problem for me. I also switched from XFS to ext4 at the same time, so that could be related, but I suspect it was the kernel upgrade that solved the problem.
Confirmed. Since we upgraded our ubuntu to 22.04 (kernel 5.15) month ago, we did not face this problem.
I started getting this error on Ubuntu 22.04 (kernel 5.15) after adding --link
parameter to the Dockerfile COPY
command.
It happens only on part of the builds.
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.16.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
@sergeykad I don't think this is related. Open a new issue if you have a reproducer.