Buildkite checkout fails, and deletes most of the repo, if any user-owned files in the checkout path are non-writable
Agent User: buildkite-agent
Example Checkout path: /var/lib/buildkite-agent/software
Read-only File: /var/lib/buildkite-agent/software/tmp/file
ls -l /var/lib/buildkite-agent/software/tmp/file
-r-xr-xr-x 1 buildkite-agent buildkite-agent 186824 Jan 11 11:49 /var/lib/buildkite-agent/software/tmp/file
In our case, a previous build rsyncs to the checkout directory, and writes tmp/file. While the owner is buildkite-agent, the permissions are set as u+rx and u-w, aka the file is "read-only".
On the next build will fail to git clean, and then it will try to delete the whole repo (and fail), and then try to clone (and fail). However, the user actually owns those files, and buildkite-agent could simply chmod o+w any files that it tries to delete, or rm -f <file>. Assuming buildkite-agent is using a filesystem interface in golang, it's likely some argument would have to change for it to delete write-only files.
Huh, that does sound awkward. We'll take a look at reproducing it and see how we go.
In the interim, a hook (maybe an in-repo one?) that fixes the permissions prior to the job starting could be a good workaround.
We are constantly encountering this issue on Windows machines where the initial git checkout process is overzealous and ends up nuking the entire repository after a permissions failure. Permissions on Windows are notoriously annoying and a dangling process (or even an open Explorer window) ends up holding a lock on the folder causing the agent to try deleting the entire repository. We are working with 500+ GB repository and the effect of this is devastating.
- Can we make the deletion optional (by flag) and simply abort the process instead of attempting to delete?
- Would a
pre-checkouthook help here? There really isn't a case for us to ever delete the repo and we instead just perform a manual intervention (say reboot the machine to free the lock)
⚠️ Warning: Checkout failed! exit status 3221225786 (Attempt 1/3 Retrying in 2s)
# Removing C:\builds\builder-1\org\repo
@marwanhilmi oof, sorry about that. Handling huge GitHub repos is particularly challenging and something we've tried to make a pass at making better, and it's on our radar for improving soon.
I'd recommend trying to reduce the size of the repo by performing a shallow clone, there's some more context @sj26 wrote over in https://github.com/buildkite/agent/issues/437#issuecomment-708068425
It also would be worth checking out the git mirrors experiment flags to see if that will help. It can be used to set up a local cache of the git repo so that checkouts can be faster. https://buildkite.com/docs/agent/v3/configuration#git-mirrors-path
Would a flag be accepted? BUILDKITE_EXIT_ON_CHECKOUT_FAILURE=true
https://github.com/stevelacy/agent/commit/4db6f24376ec8f63191ac707747e93f9235e796c
We are using LFS - even a shallow clone will be huge.
🤔 we could discuss it. I think the preferred approach in that situation would be to just override the checkout hook as @yob suggested so that it doesn't purge the directory, is there a reason why that wouldn't work here?