buildkit
buildkit copied to clipboard
Docker buildkit stuck with high CPU and unresponsive
Symtoms: Every so often docker builds break (fail to complete) and upon further inspection most of the CPU on the system is consumed by the docker process itself. If we wait long enough things recover but that can be a while.
When running pprof (curl -o pprof --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60) we get the following showing docker is spending it's time in buildkit/solver
environment:
docker-buildx-plugin 0.14.0-1~ubuntu.22.04~jammy amd64
ii docker-ce 5:26.0.2-1~ubuntu.22.04~jammy amd64
ii docker-ce-cli 5:26.0.2-1~ubuntu.22.04~jammy amd64
ii docker-compose-plugin 2.27.0-1~ubuntu.22.04~jammy amd64
Large build systems (1+TB disk, 50+ cores) that are accessed using a remote docker build host as per docker context inspect -f '{{json .Endpoints.docker.Host}}', so we have lots of concurrent builds etc.
Looks like similar to https://github.com/moby/buildkit/pull/4917#issuecomment-2109644009 . Do you have example case or parameters for such builds. If you can provide us a reproducible case that would help a lot. I assume it is using remote cache export as that's that is visible from the trace.
You can also try https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock when it looks to be hanging.
I don't have an example of how to reproduce, but we do have some very large dockerfiles (several hundred RUN commands, but in a multi stage docker build so the manifest has fewer than 100 layers) so it could be related. Next time it happens I will follow the link you shared and update this ticket with what I gather.
@jogo-openai And you are using --export-cache ?
just checked, doesn't look like we are. I checked based on https://docs.docker.com/build/cache/backends/
@tonistiigi hope this helps:
Attached are two dumps from running debug/pprof/goroutine?debug=2 as per https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock
There seem to be multiple ongoing builds in the trace that are in the middle of creating provenance. This code reuses the cache export codepath (that confused me before) to find all the cache sources that have layer chains associated with them.
I improved a performance of this part in #4947 that makes quite a big difference in my measurements but as your trace shows that current active function is addBacklinks I'm not sure if it does for you. For the provenance creation we don't actually need to create new cache relationships (these would only be needed in actual cache export) so I think we can fix your issue by skipping these calls. But I would like to get to the bottom of what case it is that is causing lot of such requests. Seems to be some combination of what commands you run and how they are shared between parallel builds.
Thank you @AkihiroSuda!
Thank you for the fix unfortunately we are still seeing the same issue with the latest release
https://github.com/docker/buildx/releases/tag/v0.15.1 should have buildkit 0.14.1 and buildkit 0.14 has this fix https://github.com/moby/buildkit/releases/tag/v0.14.0
ii docker-ce 5:26.0.2-1~ubuntu.22.04~jammy amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:26.0.2-1~ubuntu.22.04~jammy amd64 Docker CLI: the open-source application container engine
ii docker-compose-plugin 2.28.1-1~ubuntu.22.04~jammy amd64 Docker Compose (V2) plugin for the Docker CLI.
Attached is the the debug output
curl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=2