agent icon indicating copy to clipboard operation
agent copied to clipboard

Buildkite checkout fails, and deletes most of the repo, if any user-owned files in the checkout path are non-writable

Open ccarpita opened this issue 3 years ago • 5 comments

Agent User: buildkite-agent Example Checkout path: /var/lib/buildkite-agent/software Read-only File: /var/lib/buildkite-agent/software/tmp/file

ls -l /var/lib/buildkite-agent/software/tmp/file
-r-xr-xr-x 1 buildkite-agent buildkite-agent 186824 Jan 11 11:49 /var/lib/buildkite-agent/software/tmp/file

In our case, a previous build rsyncs to the checkout directory, and writes tmp/file. While the owner is buildkite-agent, the permissions are set as u+rx and u-w, aka the file is "read-only".

On the next build will fail to git clean, and then it will try to delete the whole repo (and fail), and then try to clone (and fail). However, the user actually owns those files, and buildkite-agent could simply chmod o+w any files that it tries to delete, or rm -f <file>. Assuming buildkite-agent is using a filesystem interface in golang, it's likely some argument would have to change for it to delete write-only files.

ccarpita avatar Jan 28 '21 15:01 ccarpita

Huh, that does sound awkward. We'll take a look at reproducing it and see how we go.

In the interim, a hook (maybe an in-repo one?) that fixes the permissions prior to the job starting could be a good workaround.

yob avatar Jan 29 '21 02:01 yob

We are constantly encountering this issue on Windows machines where the initial git checkout process is overzealous and ends up nuking the entire repository after a permissions failure. Permissions on Windows are notoriously annoying and a dangling process (or even an open Explorer window) ends up holding a lock on the folder causing the agent to try deleting the entire repository. We are working with 500+ GB repository and the effect of this is devastating.

  • Can we make the deletion optional (by flag) and simply abort the process instead of attempting to delete?
  • Would a pre-checkout hook help here? There really isn't a case for us to ever delete the repo and we instead just perform a manual intervention (say reboot the machine to free the lock)
⚠️ Warning: Checkout failed! exit status 3221225786 (Attempt 1/3 Retrying in 2s)
# Removing C:\builds\builder-1\org\repo

marwanhilmi avatar Apr 05 '21 20:04 marwanhilmi

@marwanhilmi oof, sorry about that. Handling huge GitHub repos is particularly challenging and something we've tried to make a pass at making better, and it's on our radar for improving soon.

I'd recommend trying to reduce the size of the repo by performing a shallow clone, there's some more context @sj26 wrote over in https://github.com/buildkite/agent/issues/437#issuecomment-708068425

It also would be worth checking out the git mirrors experiment flags to see if that will help. It can be used to set up a local cache of the git repo so that checkouts can be faster. https://buildkite.com/docs/agent/v3/configuration#git-mirrors-path

chloeruka avatar Apr 06 '21 03:04 chloeruka

Would a flag be accepted? BUILDKITE_EXIT_ON_CHECKOUT_FAILURE=true https://github.com/stevelacy/agent/commit/4db6f24376ec8f63191ac707747e93f9235e796c

We are using LFS - even a shallow clone will be huge.

stephenlacy avatar Apr 06 '21 15:04 stephenlacy

🤔 we could discuss it. I think the preferred approach in that situation would be to just override the checkout hook as @yob suggested so that it doesn't purge the directory, is there a reason why that wouldn't work here?

chloeruka avatar Apr 07 '21 06:04 chloeruka