actions-runner-controller `EphemeralRunner` stuck in failed state if the job it was allocated to is cancelled

Checks

[x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[x] I am using charts that are officially provided

Controller Version

0.11.0

Deployment Method

ArgoCD

Checks

[x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This is a subtle timing issue that is reproducible, I believe, when a GHA job is queued and quickly cancelled.

Describe the bug

An ephemeral runner that starts up and is assigned to a cancelled job sometimes results in a failed ephemeral runner.

The EphemeralRunner has this status:

$ kubectl describe ephemeralrunner/sculpt-ttqvx-runner-2s4qv
...
Status:
  Failures:
    2a9149f5-02da-475f-a15f-52f429182f60:  true
    4be0e5df-5c17-4770-9071-c68ae9723ac9:  true
    519cf001-8b1b-478c-b28a-3dc0d44b0109:  true
    7f0ee4e7-72bd-4324-81c9-b325dda1d029:  true
    a82f7aa6-975c-4437-af5c-8e7a2bfcf44a:  true
    c9c461d0-3815-4ec7-afd0-825e81ff0e23:  true
  Message:                                 Pod has failed to start more than 5 times:
  Phase:                                   Failed
  Ready:                                   false
  Reason:                                  TooManyPodFailures
  Runner Id:                               41804
  Runner JIT Config:                       <omitting>
  Runner Name:                             sculpt-ttqvx-runner-2s4qv
Events:

Some logs from the pod that fails to start:

[RUNNER 2025-05-19 16:32:14Z ERR  GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=<omitted>&status=Online&runnerVersion=2.324.0&os=Linux&architecture=X64&disableUpdate=true failed. HTTP Status: NotFound

[RUNNER 2025-05-19 16:32:14Z INFO Runner] Deleting Runner Session...

[RUNNER 2025-05-19 16:32:14Z ERR  Terminal] WRITE ERROR: An error occurred: Runner not found

[RUNNER 2025-05-19 16:32:14Z ERR  Listener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)

Runner listener exit with terminated error, stop the service, no retry needed.

Exiting runner...

Describe the expected behavior

The ephemeralrunner should not enter a failed state in this case.

Additional Context

N/A

Controller Logs

https://gist.github.com/niodice/cc77fbf8ca7ec996c9b418c36f35d9d1

Runner Pod Logs

See above

May 19 '25 18:05 niodice

Based on the number of 👍 this seems to possibly be happening for other people. I also believe it to have other negative impacts such as causing subsequent workflows to queue and wait for a pending runner until the EphemeralRunner is deleted, which I've described in https://support.github.com/ticket/personal/0/3381328.

Would be great to get this triaged / looked at by a maintainer.

May 26 '25 14:05 niodice

The issue of subsequent workflows getting queued due to the failed runner connects to #3953 and #3821. Just trying to link up related issues in the hopes that maintainers see the scope of the problem.

May 29 '25 17:05 patrickvinograd

Related: https://github.com/actions/runner/issues/3819

May 31 '25 17:05 arohter

Hi, see my investigation results (Failed EphemeralRunners graph and correlation with logs)

Jun 05 '25 20:06 atsu85

👋 coming over from https://github.com/actions/runner/issues/3857:

We've identified the issue in our API: we have a race condition where at times jobs assigned to ephemeral runners get cleaned up before the runner last poll and we respond with a 404. This is intended to reject polls from already used ephemeral runners and avoid they get considered for new jobs but in these cases is happening prematurely.

We'll have a fix ready shortly, I'll update here when it's fully rolled out. Sorry about this and thank you all for your patience ❤

Jun 06 '25 18:06 rentziass

Our fix is now fully rolled out, and new runners should now stop failing (especially those that ran a job that was canceled) due to Runner not found errors. Please let us know if you see any repeats and thanks again for your patience ❤

Jun 06 '25 20:06 rentziass

new runners should now stop failing (especially those that ran a job that was canceled) due to Runner not found errors. Please let us know if you see any repeats and thanks again for your patience

hi @rentziass Thanks for the fix.

We still see version 0.11.0 as the latest version in https://github.com/actions/actions-runner-controller/pkgs/container/actions-runner-controller-charts%2Fgha-runner-scale-set-controller . Is there any way to get the new version that includes the fix?

Jun 10 '25 05:06 karthiksedoc

@karthiksedoc the fix for the 404s causing runners to fail (consistently for canceled jobs) happened on the API side of things, the next version of ARC will also include https://github.com/actions/actions-runner-controller/pull/4059 which should help whenever runners do fail.

Jun 10 '25 09:06 rentziass

@rentziass is this API side fix a concern for people running ARC with Github Enterprise Server? E.g. does the latest GHE 3.17.1 release contain anything that would help this issue as I see a few Failed EphemeralRunners that I had to delete manually...

Jun 25 '25 14:06 tyrken

I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?

Jul 31 '25 18:07 strowk

I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?

My team has been experiencing this for a few weeks as well. Currently we have some automation that automatically deletes runners that haven't acquired a job after ~15 minutes. It would be great to get this fixed properly upstream.

Jul 31 '25 18:07 ajschmidt8

I have a similar issue at the moment, except that runner stuck in Running, not failed state. If I cancel the job while runner haven't been created yet, this results in such a leak - runner pod gets stuck seemingly indefinitely. Should this be a separate issue?

My team has been experiencing this for a few weeks as well. Currently we have some automation that automatically deletes runners that haven't acquired a job after ~15 minutes. It would be great to get this fixed properly upstream.

can you share this automation by any chance?

Sep 04 '25 18:09 brunogomes1

@ajschmidt8 please do share that script or autoamtion you run. Would be interested to learn how you do this.

Sep 22 '25 14:09 danielamar101-pton

Our solution is a non-trivial Golang application that we wrote and deployed to our cluster. It contains some code that's specific to our environment, so it's not something we can open source unfortunately. The good news is that this issue appears to have been fixed in a recent PR, so it should hopefully be resolved in the next release.

Sep 22 '25 14:09 ajschmidt8