Checkout bricks a self-hosted runner and cannot recover
Something went wrong, and all of our self-hosted runners checked out bad .git folders or somehow corrupted them. It happened on around 13 of our runners at the same time. I think it was a random occurrence, because I had to manually login and delete the repository folder, and then it was fine.
Here are our logs:
2023-01-30T02:56:34.9249114Z Waiting for a runner to pick up this job...
2023-01-30T04:54:24.3969588Z Job is about to start running on the runner: XXXXXXXXXXXXXXXXXXXXXXXX (organization)
2023-01-30T04:54:29.3070556Z Current runner version: '2.301.1'
2023-01-30T04:54:29.3077744Z Runner name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3078128Z Runner group name: 'Default'
2023-01-30T04:54:29.3078642Z Machine name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3080746Z ##[group]GITHUB_TOKEN Permissions
2023-01-30T04:54:29.3081343Z Actions: write
2023-01-30T04:54:29.3081520Z Checks: write
2023-01-30T04:54:29.3081693Z Contents: write
2023-01-30T04:54:29.3081906Z Deployments: write
2023-01-30T04:54:29.3082186Z Discussions: write
2023-01-30T04:54:29.3082429Z Issues: write
2023-01-30T04:54:29.3082608Z Metadata: read
2023-01-30T04:54:29.3082779Z Packages: write
2023-01-30T04:54:29.3082958Z Pages: write
2023-01-30T04:54:29.3083147Z PullRequests: write
2023-01-30T04:54:29.3083476Z RepositoryProjects: write
2023-01-30T04:54:29.3083696Z SecurityEvents: write
2023-01-30T04:54:29.3083888Z Statuses: write
2023-01-30T04:54:29.3084056Z ##[endgroup]
2023-01-30T04:54:29.3087171Z Secret source: Actions
2023-01-30T04:54:29.3087569Z Prepare workflow directory
2023-01-30T04:54:29.4388409Z Prepare all required actions
2023-01-30T04:54:29.4550014Z Getting action download info
2023-01-30T04:54:29.8524043Z Download action repository 'actions/checkout@v3' (SHA:ac593985615ec2ede58e132d2e21d2b1cbd6127c)
2023-01-30T04:54:30.9083915Z Complete job name: XXXXXXXXXXXXXXXXXXXXXXXX
2023-01-30T04:54:31.0985565Z ##[group]Run actions/checkout@v3
2023-01-30T04:54:31.0985877Z with:
2023-01-30T04:54:31.0986059Z repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:31.0986462Z token: ***
2023-01-30T04:54:31.0986609Z ssh-strict: true
2023-01-30T04:54:31.0986786Z persist-credentials: true
2023-01-30T04:54:31.0986951Z clean: true
2023-01-30T04:54:31.0987092Z fetch-depth: 1
2023-01-30T04:54:31.0987234Z lfs: false
2023-01-30T04:54:31.0987377Z submodules: false
2023-01-30T04:54:31.0987547Z set-safe-directory: true
2023-01-30T04:54:31.0987702Z env:
2023-01-30T04:54:31.0987887Z TMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988151Z TEMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988398Z TMPDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988665Z MATLAB_PREFDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.preferences
2023-01-30T04:54:31.0988870Z ##[endgroup]
2023-01-30T04:54:34.6968863Z Syncing repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:34.6970512Z ##[group]Getting Git version info
2023-01-30T04:54:34.6970936Z Working directory is 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:34.6971402Z [command]"C:\Program Files\Git\cmd\git.exe" version
2023-01-30T04:54:34.7493487Z git version 2.36.1.windows.1
2023-01-30T04:54:34.7592122Z ##[endgroup]
2023-01-30T04:54:34.7607048Z Temporarily overriding HOME='C:\runner\e595c9b9\_work\_temp\bcafa367-f8cb-4d31-84b1-63d10aaaabed' before making global git config changes
2023-01-30T04:54:34.7607516Z Adding repository directory to the temporary git global config as a safe directory
2023-01-30T04:54:34.7608114Z [command]"C:\Program Files\Git\cmd\git.exe" config --global --add safe.directory C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX
2023-01-30T04:54:34.8483251Z [command]"C:\Program Files\Git\cmd\git.exe" config --local --get remote.origin.url
2023-01-30T04:54:34.8992096Z ##[error]fatal: --local can only be used inside a git repository
2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'
2023-01-30T04:54:35.4710729Z Post job cleanup.
2023-01-30T04:54:38.8875206Z Cleaning up orphan processes
In this case, checkout seems to be bailing fatally, i.e. after the error fatal: --local can only be used inside a git repository, the actions run ends immediately with a fault and won't try and continue.
This effectively bricked the runner because any jobs that the bad runner would pick up would fail instantly. Not only that, but the bad runner would take all the jobs in the queue and virtually instantly fail them, which messed up our job history quite a bit unfortunately.
Since the resolution step was simply to login and delete the offending bad folder, it would be nice if it would automatically nuke away the folder and retry once.
It seems like it tried this:
2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'
I am not sure why that didn't work, since I was able to login and just rm the folder fine as the same user. In any case, all 13 runners failed to delete the folder automatically.
To reproduce, I would suggest:
- Install self hosted runner on Windows Server 2022 running as a service and using a non-admin service user (i.e. Bob)
- Setup action to checkout repository
- Manually corrupt the .git folder by adding extra random files into it (?)
- Ensure
git config --local --get remote.origin.urlfails - Observe consequent jobs acquired by this runner will fail instantly and it will fail to recover
Depending on how this is addressed, it could also fix other issues i.e: https://github.com/actions/checkout/issues/933 , since that issue with submodule corruption is also fixed by just deleting the repo and allowing the runner to do a fresh clone ( https://github.com/actions/checkout/issues/988#issuecomment-1292232838 ).
For example, as a broad workaround it could give up on reusing the existing git repository if any commands throw a fault, and try to delete and checkout the repository from scratch.
Sometime ago there was a fix for this was introduced https://github.com/actions/checkout/pull/964, but it seems it doesn't solve the issue. I might be wrong.
در تاریخ پنجشنبه ۲ مارس ۲۰۲۳، ۰۰:۲۸ Olzhas ADIYATOV < @.***> نوشت:
Sometime ago there was a fix for this was introduced #964 https://github.com/actions/checkout/pull/964, but it seems it doesn't solve the issue. I might be wrong.
— Reply to this email directly, view it on GitHub https://github.com/actions/checkout/issues/1148#issuecomment-1450835541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY5MTFAF2TZ6Q2ROWYDYJYTWZ6Z75ANCNFSM6AAAAAAULXDMHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Sometime ago there was a fix for this was introduced #964, but it seems it doesn't solve the issue. I might be wrong.
We are using checkout v3 and this still seems to be an issue.
Hi, also running into this issue.
Does anyone have a workaround for this?
Hi.... how fix if runners please send me your txt....
I have been using the following workaround while waiting for the fix:
- name: checkout
id: checkout
uses: actions/checkout@v3
with:
ref: ${{ inputs.ref }}
submodules: "recursive"
token: ${{ secrets.token }}
- name: cleanup runner workspace
run: |
echo $GITHUB_WORKSPACE
rm -rf $GITHUB_WORKSPACE
mkdir $GITHUB_WORKSPACE
shell: bash
if: ${{ failure() && steps.checkout.conclusion == 'failure' }}
This atleast prevents the runner from being bricked if checkout fails either due to corrupted .git folder or bad submodules.
Good workaround, thanks!
I just wanted to add that I ran into this one today:
Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of 'C:\runner\31f270db\_work\aaaa\bbbb'
Error: File was unable to be removed Error: EBUSY: resource busy or locked, rmdir 'C:\runner\31f270db\_work\aaaa\bbbb\work'
It then went ahead and gobbled up all the remaining jobs in the entire queue and failed them all with the same error.
Edit: Seems like the above is an unrelated issue to what is mentioned in the first post, this time there was some random cc1plus process hanging around that had a lock on a directory in the git folder and it seemed to have gotten stuck and was preventing git clean from running. I don't expect the checkout action to hunt down and kill processes, but I think I will fix this with a powershell script.
Happened again in a big way today :(
@Ajaydip I tried your workaround and it didn't work for me, it always skips the action?
run-tests:
name: xxxx
runs-on: [self-hosted]
timeout-minutes: 90
strategy:
fail-fast: false
matrix:
include: ${{fromJson(needs.scan-tests.outputs.matrix)}}
steps:
- uses: actions/checkout@v3
id: checkout
timeout-minutes: 10
continue-on-error: true
- name: Cleanup previously failed job
run: |
Remove-Item "${{env.GITHUB_WORKSPACE}}" -Force -Recurse -ErrorAction SilentlyContinue | Out-Null
New-Item -ItemType Directory -Force -Path "${{env.GITHUB_WORKSPACE}}" | Out-Null
if: ${{ steps.checkout.conclusion == 'failure' }}
- uses: actions/checkout@v3
if: ${{ steps.checkout.conclusion == 'failure' }}
I did modify it a little bit .. I was hoping to be able to recover and run the rest of the pipeline unaffected without having to put an if: ... on every step.
Edit:
If you've done what I did above, you probably want to use outcome not conclusion -- https://docs.github.com/en/actions/learn-github-actions/contexts#steps-context .
Any update on this? Does anyone have a working workaround?
@bryanjtc the workaround above works OK, just note my Edit about using ‘outcome’ not ‘conclusion’ for testing whether to retry.
2.5 years later and still not fixed. Also stumbled into this one.
Yes, and no longer interested in anyone fixing it.
Oh wow. At least that's honest, unlike the quarterly surveys popups asking how satisfied I am with GitHub actions.