buildkit
buildkit copied to clipboard
buildkit fails with "mount already exists with name"
With a similar Dockerfile to https://github.com/moby/buildkit/issues/1459#issuecomment-618690727:
# syntax = docker/dockerfile:1.1.5-experimental
FROM ocurrent/opam@sha256:5bb6268b1fb7a8dfceffd92383cb4c72d0c3f995231ff916fa6aaf0b29bff4df
COPY --chown=opam . /src/
RUN git -C /src checkout -b opam-ci__cibranch a1f2c8222b2a755968b2212471fd1732bdaf6f3d && \
git -C /src merge 49560cde56610f9fd03c0af5ef852e9b873ecf6c && \
opam repository set-url --strict default file:///src
RUN opam pin add -k version -yn conf-nauty.1.1 1.1
RUN --mount=type=cache,target=/home/opam/.opam/download-cache,uid=1000 opam depext -uivy conf-nauty.1.1
our docker build sometimes fails with:
#1 [internal] load .dockerignore
#1 sha256:95b7a97096dc2afd2d43c2337088551dcb79f6b7cd7780992776af1afaca6f81
#1 transferring context:
#1 transferring context: 2B done
#1 DONE 0.6s
#2 [internal] load build definition from Dockerfile
#2 sha256:86eb8a668e7aabc92f8c971f945dc6e33bb4dafb625349e0e4b3e99e34f50a19
#2 transferring dockerfile: 572B done
#2 DONE 0.9s
#3 resolve image config for docker.io/docker/dockerfile:1.1.5-experimental
#3 sha256:4d904ca4dadcfe285b7e07c14c496e2212a754f0c4fd12946baa040350cae4f0
#3 DONE 0.9s
#4 docker-image://docker.io/docker/dockerfile:1.1.5-experimental@sha256:787107d7f7953cb2d95ee81cc7332d79aca9328129318e08fc7ffbd252a20656
#4 sha256:d2736ec635330516879555dba7aa00a3afb686946df7f09d957dc33a05f95b62
#4 CACHED
#5 [internal] load metadata for docker.io/ocurrent/opam@sha256:5bb6268b1fb7a8dfceffd92383cb4c72d0c3f995231ff916fa6aaf0b29bff4df
#5 sha256:c3d18c31ea0787222f2fb8d1ea33b95119a8b415e570a2d026f6aea36d6ff8b9
#5 DONE 0.0s
#11 [internal] settings cache mount permissions
#11 sha256:98e142695adfe60b7bd7795f3db2122cdc761dcae59a13eab856140b08987ce3
#11 DONE 0.0s
#6 [stage-0 1/5] FROM docker.io/ocurrent/opam@sha256:5bb6268b1fb7a8dfceffd92383cb4c72d0c3f995231ff916fa6aaf0b29bff4df
#6 sha256:16da8708e67b40ea0ed8ecd0041c80825c6c6966930c7a99e397d1a2a128b466
#6 DONE 0.0s
#7 [internal] load build context
#7 sha256:cf9ebfd4c665038fcc01c0f0427a7e3c4451e7b6dec9591b287b850ddb7a0761
#7 transferring context: 104.60MB 2.6s
#7 transferring context: 111.47MB 3.5s done
#7 DONE 4.6s
#6 [stage-0 1/5] FROM docker.io/ocurrent/opam@sha256:5bb6268b1fb7a8dfceffd92383cb4c72d0c3f995231ff916fa6aaf0b29bff4df
#6 sha256:16da8708e67b40ea0ed8ecd0041c80825c6c6966930c7a99e397d1a2a128b466
#6 CACHED
#11 [internal] settings cache mount permissions
#11 sha256:98e142695adfe60b7bd7795f3db2122cdc761dcae59a13eab856140b08987ce3
#11 CACHED
#8 [stage-0 2/5] COPY --chown=opam . /src/
#8 sha256:84fcec75053eb6d4d45b6fb8bb78931ad7518971d40c96bfcc6fe7c2b7a09cd0
#8 ERROR: mount already exists with name
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to build LLB: mount already exists with name
I've tried several hours to get a minimal test-case without success, however it still happens pretty often on our test server.
Here is our daemon log:
Apr 24 00:26:22 m1-d dockerd[37339]: time="2020-04-24T00:26:22.963833995Z" level=info msg="parsed scheme: \"\"" module=grpc
Apr 24 00:26:22 m1-d dockerd[37339]: time="2020-04-24T00:26:22.963920949Z" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
Apr 24 00:26:22 m1-d dockerd[37339]: time="2020-04-24T00:26:22.963962093Z" level=info msg="ccResolverWrapper: sending update to cc: {[{ 0 <nil>}] <nil>}" module=grpc
Apr 24 00:26:22 m1-d dockerd[37339]: time="2020-04-24T00:26:22.963972759Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Apr 24 00:26:30 m1-d dockerd[37339]: time="2020-04-24T00:26:30Z" level=error msg="fatal error: rpc error: code = Unknown desc = failed to build LLB: mount already exists with name"
Apr 24 00:26:30 m1-d dockerd[37339]: panic: rpc error: code = Unknown desc = failed to build LLB: mount already exists with name
Apr 24 00:26:30 m1-d dockerd[37339]: goroutine 1 [running]:
Apr 24 00:26:30 m1-d dockerd[37339]: main.main()
Apr 24 00:26:30 m1-d dockerd[37339]: /src/frontend/dockerfile/cmd/dockerfile-frontend/main.go:26 +0x34f
Apr 24 00:26:31 m1-d dockerd[37339]: time="2020-04-24T00:26:31.282392042Z" level=warning msg="grpc: addrConn.createTransport failed to connect to { 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc
What version/software this is?
Client: Docker Engine - Community
Version: 19.03.8
API version: 1.40
Go version: go1.12.17
Git commit: afacb8b7f0
Built: Wed Mar 11 01:25:56 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.8
API version: 1.40 (minimum version 1.12)
Go version: go1.12.17
Git commit: afacb8b7f0
Built: Wed Mar 11 01:24:28 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
We see this sporadically in our CI pipeline -- we have make
kick off several parallel builds and every once in a while the job will fail with the "mount already exists with name"
error:
#8 [build 2/4] WORKDIR /src/api
#8 ERROR: mount already exists with name
#9 [internal] load build context
#9 transferring context: 7.09MB 0.1s done
#9 CANCELED
------
> [build 2/4] WORKDIR /src/api:
------
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to build LLB: mount already exists with name
Makefile:309: recipe for target 'docker-build-swagger' failed
make: *** [docker-build-swagger] Error 1
make: *** Waiting for unfinished jobs....
Could it be because the stages are aliased with the same name? The Dockerfile
's that are built at that time both have statements like:
FROM <our-common-build-image>:latest AS build
FYI we hit this on docker:20.10.2. We kick off two builds using the same docker-dind service using K8s on GitLab CI. Those builds have lots of similarities, including overlapping build stage names.
We see this in CI as well when using make to parallelize multiple multistage builds that all have the same FROM imagename as builder
. I'm wondering if @dotcarls might be on to something.
Yeah we just hit this too:
#9 [stage-1 2/4] COPY --chown=atcloud:atcloud .npmrc package.json /usr/local/autotrader/app/
#9 sha256:be4cfa891678f62a7c246fdab682b5253f37aeb2b9b4b0c0640aa38d3fff47c2
#9 ERROR: mount already exists with name
------
> [build 2/6] COPY --chown=atcloud:atcloud package.json .npmrc /usr/local/autotrader/app/:
------
------
> [stage-1 2/4] COPY --chown=atcloud:atcloud .npmrc package.json /usr/local/autotrader/app/:
------
mount already exists with name
Service 'app' failed to build : Build failed
your build failed! check the logs above to understand the reason, you can recreate locally by running 'docker-compose build'
Two layers from separate stages copying files from the host, and similar to the above, both of our layers are FROM
the same image.
bash-5.1# docker version
Client:
Version: 20.10.8
API version: 1.41
Go version: go1.16.6
Git commit: 3967b7d
Built: Fri Jul 30 19:50:40 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:55:09 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b638
docker-init:
Version: 0.19.0
GitCommit: de40ad0
We see this in CI as well when using make to parallelize multiple multistage builds that all have the same
FROM imagename as builder
. I'm wondering if @dotcarls might be on to something.
This seems to be a dead end. We renamed our multistage names to be unique for each parallelized (ie. FROM image name as builder_for_app1
, FROM imagename as builder_for_app2
, etc.) and still saw this issue.
Seeing this with Docker version 20.10.11
, on a gitlab-runner host for CI pipeline.
How can we investigate to understand what's going on?
We also see this problem with docker 23.01. It appears when we run 2 simultaneous docker build commands with the same FROM image and the same layers up to the first of the COPY commands, like in the example of user Stono at https://github.com/moby/buildkit/issues/1460#issuecomment-938536403
We were able to fix this by building with "--no-cache", accepting the downsides of this approach.
This leads to the following hypothesis on the root cause:
- Assume a simple Dockerfile like:
FROM myimage:mytag
COPY a/ /a/
COPY b/ /b/
- We run two build processes in parallel, both using the same Dockerfile, but different build contexts.
- In our contexts both folders "a" have the same content, while folders "b" differ in content.
- If caching is switched on, the docker engine will remember each layer for further re-use. For a COPY command, this should be determined by the destination path and the content of the source file(s) to copy (in some way, e.g. file names, timestamps, content hashes?).
- As we start 2 build processes in parallel, both from the same FROM image, and that image was not yet used, the first build process which hits the first COPY command (on source "a") cannot reuse a cached layer. Thus it will execute the COPY and then it will create a new cached layer from it.
- If the second build process is run after the first one, it would at the first COPY command on source "a" realize that the source content is the same, the destination is the same and re-use the cached image.
- However, as we run the build processes simultaneously, the suspicion is that there is a race condition between the build processes. Both seem to realize that no cached layer exists, build the layer and try to add the layer to the cache.
- If both simultaneously try to add the same layer to the cache, one wins, the other fails with the "mount already exists with name" error.
- In this hypothesis the remedy of using "--no-cache" during build makes that the build processes do not try to record a layer to the cache and thus do not fail on the race condition.
Looking through the code where this error could occur, the only place this occurs is here:
https://github.com/moby/moby/blob/f00569b89154a5ce8c290d5420019bb20a8ab632/layer/layer_store.go#L480-L501
This has 2 callers (for build): https://github.com/moby/moby/blob/f00569b89154a5ce8c290d5420019bb20a8ab632/builder/builder-next/adapters/snapshot/snapshot.go#L280-L285
https://github.com/moby/moby/blob/f00569b89154a5ce8c290d5420019bb20a8ab632/daemon/images/image_builder.go#L62-L63
The "name" here is always a random ID generated by crypto/rand
.