buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Docker buildkit stuck with high CPU and unresponsive

Open jogo-openai opened this issue 1 year ago • 6 comments

Symtoms: Every so often docker builds break (fail to complete) and upon further inspection most of the CPU on the system is consumed by the docker process itself. If we wait long enough things recover but that can be a while.

When running pprof (curl -o pprof --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60) we get the following showing docker is spending it's time in buildkit/solver

image

environment:

docker-buildx-plugin          0.14.0-1~ubuntu.22.04~jammy             amd64 
ii  docker-ce                     5:26.0.2-1~ubuntu.22.04~jammy           amd64  
ii  docker-ce-cli                 5:26.0.2-1~ubuntu.22.04~jammy           amd64
ii  docker-compose-plugin         2.27.0-1~ubuntu.22.04~jammy             amd64

Large build systems (1+TB disk, 50+ cores) that are accessed using a remote docker build host as per docker context inspect -f '{{json .Endpoints.docker.Host}}', so we have lots of concurrent builds etc.

jogo-openai avatar May 21 '24 19:05 jogo-openai

Looks like similar to https://github.com/moby/buildkit/pull/4917#issuecomment-2109644009 . Do you have example case or parameters for such builds. If you can provide us a reproducible case that would help a lot. I assume it is using remote cache export as that's that is visible from the trace.

You can also try https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock when it looks to be hanging.

tonistiigi avatar May 22 '24 00:05 tonistiigi

I don't have an example of how to reproduce, but we do have some very large dockerfiles (several hundred RUN commands, but in a multi stage docker build so the manifest has fewer than 100 layers) so it could be related. Next time it happens I will follow the link you shared and update this ticket with what I gather.

jogo-openai avatar May 22 '24 16:05 jogo-openai

@jogo-openai And you are using --export-cache ?

tonistiigi avatar May 22 '24 16:05 tonistiigi

just checked, doesn't look like we are. I checked based on https://docs.docker.com/build/cache/backends/

jogo-openai avatar May 22 '24 16:05 jogo-openai

@tonistiigi hope this helps:

Attached are two dumps from running debug/pprof/goroutine?debug=2 as per https://github.com/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock

dump-2.txt dump.txt

jogo-openai avatar May 22 '24 20:05 jogo-openai

There seem to be multiple ongoing builds in the trace that are in the middle of creating provenance. This code reuses the cache export codepath (that confused me before) to find all the cache sources that have layer chains associated with them.

I improved a performance of this part in #4947 that makes quite a big difference in my measurements but as your trace shows that current active function is addBacklinks I'm not sure if it does for you. For the provenance creation we don't actually need to create new cache relationships (these would only be needed in actual cache export) so I think we can fix your issue by skipping these calls. But I would like to get to the bottom of what case it is that is causing lot of such requests. Seems to be some combination of what commands you run and how they are shared between parallel builds.

tonistiigi avatar May 23 '24 18:05 tonistiigi

Thank you @AkihiroSuda!

jogo-openai avatar May 31 '24 15:05 jogo-openai

Thank you for the fix unfortunately we are still seeing the same issue with the latest release

https://github.com/docker/buildx/releases/tag/v0.15.1 should have buildkit 0.14.1 and buildkit 0.14 has this fix https://github.com/moby/buildkit/releases/tag/v0.14.0

ii  docker-ce                     5:26.0.2-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                 5:26.0.2-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-compose-plugin         2.28.1-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.

Attached is the the debug output curl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=2

log.txt

jogo-openai avatar Jun 27 '24 20:06 jogo-openai