agent icon indicating copy to clipboard operation
agent copied to clipboard

Bug: Build Skipping prevents Docker cleanup

Open samking opened this issue 3 years ago • 4 comments

My buildkite jobs have lately been failing due to an inability to provision enough Docker network interfaces, similar to https://github.com/buildkite/agent/issues/277. However, most of our jobs successfully clean up their network interfaces.

I discovered that all of the jobs that failed to clean up their network interfaces failed to do so because the corresponding container was still running, and all of those containers corresponded to jobs that were canceled after a few minutes due to build skipping settings.

I haven't verified whether or not there's a zombie container every time build skipping happens.

Let me know if you need any additional information!

samking avatar Dec 21 '21 00:12 samking

My workaround (partly borrowed from https://github.com/buildkite/agent/issues/277):

sudo docker container ls | grep -E '(days|weeks|months) ago' | awk '{print $1}' | xargs sudo docker container stop
sudo docker network ls | grep buildkite | awk '{print $1}' | xargs sudo docker network rm

samking avatar Dec 21 '21 00:12 samking

Eeep, thanks so much for reporting @samking. How are you finding that workaround is suiting you? Another option is using a pre-exit agent hook, these run after cancel.

It would be helpful if you could verify the presence of a zombie container on build skip. This sounds a bit like a build cancellation issue rather than a build skip; can you confirm for me the settings you have configured in your pipeline?

eleanorakh avatar Jan 11 '22 04:01 eleanorakh

Thanks for following up!

How are you finding that workaround is suiting you?

We ended up turning off "Cancel Intermediate Builds" because the network interfaces would fill up too quickly when people were pushing a lot of changes to their branch (which was triggering new builds to be queued and old builds to be canceled). Having that setting was nice, so it's a little annoying to lose out on it.

Adding a pre-exit agent hook would probably do what we want, but realistically I don't expect us to put in the time to test everything out and get it working.

It would be helpful if you could verify the presence of a zombie container on build skip. This sounds a bit like a build cancellation issue rather than a build skip; can you confirm for me the settings you have configured in your pipeline?

Sorry for being unclear! To clarify, are you asking if the setting causing problems is "Skip Intermediate Builds" vs "Cancel Intermediate Builds"? You're correct that we were having trouble with the "Cancel Intermediate Builds" setting. "Skip Intermediate Builds" is still enabled and hasn't been causing us problems. (I called it build skipping because they're both grouped together under the "build skipping" heading, and the message about jobs being canceled due to build skipping settings said build skipping as well).

Does that answer your question?

samking avatar Jan 12 '22 00:01 samking

Thanks @samking, this is really helpful! I've added this to our backlog to dive into.

eleanorakh avatar Jan 18 '22 00:01 eleanorakh