`EphemeralRunner` stuck in failed state if the job it was allocated to is cancelled
Checks
- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided
Controller Version
0.11.0
Deployment Method
ArgoCD
Checks
- [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
This is a subtle timing issue that is reproducible, I believe, when a GHA job is queued and quickly cancelled.
Describe the bug
An ephemeral runner that starts up and is assigned to a cancelled job sometimes results in a failed ephemeral runner.
The EphemeralRunner has this status:
$ kubectl describe ephemeralrunner/sculpt-ttqvx-runner-2s4qv
...
Status:
Failures:
2a9149f5-02da-475f-a15f-52f429182f60: true
4be0e5df-5c17-4770-9071-c68ae9723ac9: true
519cf001-8b1b-478c-b28a-3dc0d44b0109: true
7f0ee4e7-72bd-4324-81c9-b325dda1d029: true
a82f7aa6-975c-4437-af5c-8e7a2bfcf44a: true
c9c461d0-3815-4ec7-afd0-825e81ff0e23: true
Message: Pod has failed to start more than 5 times:
Phase: Failed
Ready: false
Reason: TooManyPodFailures
Runner Id: 41804
Runner JIT Config: <omitting>
Runner Name: sculpt-ttqvx-runner-2s4qv
Events:
Some logs from the pod that fails to start:
[RUNNER 2025-05-19 16:32:14Z ERR GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=<omitted>&status=Online&runnerVersion=2.324.0&os=Linux&architecture=X64&disableUpdate=true failed. HTTP Status: NotFound
[RUNNER 2025-05-19 16:32:14Z INFO Runner] Deleting Runner Session...
[RUNNER 2025-05-19 16:32:14Z ERR Terminal] WRITE ERROR: An error occurred: Runner not found
[RUNNER 2025-05-19 16:32:14Z ERR Listener] at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
Runner listener exit with terminated error, stop the service, no retry needed.
Exiting runner...
Describe the expected behavior
The ephemeralrunner should not enter a failed state in this case.
Additional Context
N/A
Controller Logs
https://gist.github.com/niodice/cc77fbf8ca7ec996c9b418c36f35d9d1
Runner Pod Logs
See above
Based on the number of 👍 this seems to possibly be happening for other people. I also believe it to have other negative impacts such as causing subsequent workflows to queue and wait for a pending runner until the EphemeralRunner is deleted, which I've described in https://support.github.com/ticket/personal/0/3381328.
Would be great to get this triaged / looked at by a maintainer.
The issue of subsequent workflows getting queued due to the failed runner connects to #3953 and #3821. Just trying to link up related issues in the hopes that maintainers see the scope of the problem.
Related: https://github.com/actions/runner/issues/3819
Hi, see my investigation results (Failed EphemeralRunners graph and correlation with logs)
👋 coming over from https://github.com/actions/runner/issues/3857:
We've identified the issue in our API: we have a race condition where at times jobs assigned to ephemeral runners get cleaned up before the runner last poll and we respond with a 404. This is intended to reject polls from already used ephemeral runners and avoid they get considered for new jobs but in these cases is happening prematurely.
We'll have a fix ready shortly, I'll update here when it's fully rolled out. Sorry about this and thank you all for your patience ❤
Our fix is now fully rolled out, and new runners should now stop failing (especially those that ran a job that was canceled) due to Runner not found errors. Please let us know if you see any repeats and thanks again for your patience ❤
new runners should now stop failing (especially those that ran a job that was canceled) due to
Runner not founderrors. Please let us know if you see any repeats and thanks again for your patience
hi @rentziass Thanks for the fix.
We still see version 0.11.0 as the latest version in https://github.com/actions/actions-runner-controller/pkgs/container/actions-runner-controller-charts%2Fgha-runner-scale-set-controller . Is there any way to get the new version that includes the fix?
@karthiksedoc the fix for the 404s causing runners to fail (consistently for canceled jobs) happened on the API side of things, the next version of ARC will also include https://github.com/actions/actions-runner-controller/pull/4059 which should help whenever runners do fail.
@rentziass is this API side fix a concern for people running ARC with Github Enterprise Server? E.g. does the latest GHE 3.17.1 release contain anything that would help this issue as I see a few Failed EphemeralRunners that I had to delete manually...
I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?
I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?
My team has been experiencing this for a few weeks as well. Currently we have some automation that automatically deletes runners that haven't acquired a job after ~15 minutes. It would be great to get this fixed properly upstream.
I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?
My team has been experiencing this for a few weeks as well. Currently we have some automation that automatically deletes runners that haven't acquired a job after ~15 minutes. It would be great to get this fixed properly upstream.
can you share this automation by any chance?
@ajschmidt8 please do share that script or autoamtion you run. Would be interested to learn how you do this.
Our solution is a non-trivial Golang application that we wrote and deployed to our cluster. It contains some code that's specific to our environment, so it's not something we can open source unfortunately. The good news is that this issue appears to have been fixed in a recent PR, so it should hopefully be resolved in the next release.