agent
agent copied to clipboard
Bug: Build Skipping prevents Docker cleanup
My buildkite jobs have lately been failing due to an inability to provision enough Docker network interfaces, similar to https://github.com/buildkite/agent/issues/277. However, most of our jobs successfully clean up their network interfaces.
I discovered that all of the jobs that failed to clean up their network interfaces failed to do so because the corresponding container was still running, and all of those containers corresponded to jobs that were canceled after a few minutes due to build skipping settings.
I haven't verified whether or not there's a zombie container every time build skipping happens.
Let me know if you need any additional information!
My workaround (partly borrowed from https://github.com/buildkite/agent/issues/277):
sudo docker container ls | grep -E '(days|weeks|months) ago' | awk '{print $1}' | xargs sudo docker container stop
sudo docker network ls | grep buildkite | awk '{print $1}' | xargs sudo docker network rm
Eeep, thanks so much for reporting @samking. How are you finding that workaround is suiting you? Another option is using a pre-exit agent hook, these run after cancel.
It would be helpful if you could verify the presence of a zombie container on build skip. This sounds a bit like a build cancellation issue rather than a build skip; can you confirm for me the settings you have configured in your pipeline?
Thanks for following up!
How are you finding that workaround is suiting you?
We ended up turning off "Cancel Intermediate Builds" because the network interfaces would fill up too quickly when people were pushing a lot of changes to their branch (which was triggering new builds to be queued and old builds to be canceled). Having that setting was nice, so it's a little annoying to lose out on it.
Adding a pre-exit agent hook would probably do what we want, but realistically I don't expect us to put in the time to test everything out and get it working.
It would be helpful if you could verify the presence of a zombie container on build skip. This sounds a bit like a build cancellation issue rather than a build skip; can you confirm for me the settings you have configured in your pipeline?
Sorry for being unclear! To clarify, are you asking if the setting causing problems is "Skip Intermediate Builds" vs "Cancel Intermediate Builds"? You're correct that we were having trouble with the "Cancel Intermediate Builds" setting. "Skip Intermediate Builds" is still enabled and hasn't been causing us problems. (I called it build skipping because they're both grouped together under the "build skipping" heading, and the message about jobs being canceled due to build skipping settings said build skipping as well).
Does that answer your question?
Thanks @samking, this is really helpful! I've added this to our backlog to dive into.