actions-runner-controller Runner get stuck in "Failed" state for indefinite time ( 0.12.1 )

Checks

[x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[x] I am using charts that are officially provided

Controller Version

0.12.1

Deployment Method

Helm

Checks

[x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. We are seeing this issue when our cluster node evicts the pod due to resource pressure on it. 
2. Runner pods failed with "Pod was rejected: Node didn't have enough resources: pods, requested: 1, used: 16, capacity: 16" as the node pool doesn't have enough resources.
3. Those failed runner pods hang with Pending runners in EphemeralRunnerSet/EphemeralRunner.

Describe the bug

The runner gets stuck in the "Failed" state for an indefinite time, failed during node pool scaling:

Describe the expected behavior

It should be cleared from AutoscalingrunnerSet/EphemeralRunnerSet/EphemeralRunner so offline runners will also be removed from github UI.

See pending runners:

Additional Context

None

Controller Logs

None

Runner Pod Logs

None

Jul 04 '25 07:07 rajesh-dhakad

Using the below script to clean up hanging runners so new runners can spin up:

kubectl get ephemeralrunners -n arc-runners -o json | jq -r '.items[] | select( .status.phase == null and .status.jobRepositoryName == null) | .metadata.name' | xargs -I {} kubectl delete ephemeralrunner -n arc-runners {}

Jul 04 '25 08:07 rajesh-dhakad

@rajesh-dhakad, can you share the logs of the controller and runner pods?

Jul 07 '25 12:07 ajschmidt8

Runner Pod failed as the node was not having sufficient resources:

Pod status: It will remain this way indefinitely, and the controller will not scale the runner, considering it as a waiting state.

Pod logs: there are no logs as got removed from node.

Controller logs: Logs-2025-07-08 11_29_21.txt

Jul 08 '25 06:07 rajesh-dhakad

@ajschmidt8 @nikola-jokic - Those failed runner pods are being counted in maxReplica, hence they are not being scaled further even if jobs are waiting. If we delete those runners manually, it scales the runner pod again.

What is the retry strategy or wait time runner has?

Jul 08 '25 06:07 rajesh-dhakad

@ajschmidt8 - Would you happen to need more details?

In both scenarios below, the pod remains "Failed" and gets counted against "currentRunnerCount," preventing new pods from being scheduled.

Scenario 1: When the pod gets evicted by the node due to a resource crunch Scenario 2: Pod got scheduled on a node by the scheduler, but didn't get allocated as there was a sudden spike in resource utilization, which made the kubelet reject those scheduling on the nodes.

-- Is there any possible way to remove those failed runners or retry after some time to clear them from the Pending Runners state from EphemeralRunnerSet/EphemeralRunner?

Controller logs: {"container":"manager","level":"info","pod":"arc-controller-5578d49d76-pv5fp","_entry":"{"severity":"info","ts":"2025-07-08T05:56:18Z","logger":"EphemeralRunner","message":"Waiting for runner container status to be available","version":"0.12.1","ephemeralrunner":{"name":"general-high-grwjh-runner-4xmmm","namespace":"arc-runners"}}"}

Jul 11 '25 06:07 rajesh-dhakad

Seems related with below issues as well: https://github.com/actions/actions-runner-controller/issues/4148 https://github.com/actions/actions-runner-controller/issues/4155

Jul 11 '25 11:07 rajesh-dhakad

@ajschmidt8 - Would you happen to need more details?

I'm not sure what the issue here is (I'm not a maintainer on this repository). I was just trying to ensure the maintainers have all the necessary information to look into this problem when they have time.

Jul 11 '25 13:07 ajschmidt8

@nikola-jokic - could you please help with this?

Jul 11 '25 14:07 rajesh-dhakad

Hi everyone! I was reviewing issues since 0.12.1 and found this one. Very detailed report! Much appreciated @rajesh-dhakad!

AFAIK, the log Waiting for runner container status to be available should indicate that the pod had no "container status for the runner container" even though it is required in order for the ephemeral runner controller to move on.

A potential fix to this would be, somehow fix the ephemeral runner controller to be able to move on even though it never give the runner container status- for example, having a "timeout" like 10m until you see the container status, and let the epehemeral runner controller proactively delete the runner pod if the timeout passes, would enable it to recreate the pod in the hope of it gets scheduled and become up and running, unstucking it.

BTW: Sorry if I correlated irrelevant prs but, if https://github.com/actions/actions-runner-controller/pull/4152 was supposed to this fix issue, it won't. The controller-runtime should ignore Requeue if RequeueAfter is specified.

It would also worth noting that this is based on the design choice made in Kubernetes- the k8s scheduler and kubelet form a distributed system, they work independently so the scheduler can decide to put a pod to inappropriate node, and you need to either (1) recreate the pod to redo the scheduling or (2) use k8s job instead of pod to automate that.

Another potential fix is modifying the autoscaling runner set controller to EXCLUDE pods that exceeded the timeout without having the non-empty runner container status. The legacy actions-runner-controller did something very similar to that, and it worked well for me, as it allowed arc to bring up replacement runners even though some runners are stuck pending/failing. But the downside of that approach was that it may leave pending resources (as far as I remember).

That said, probably fixing the ephemeral runer controller is great for the starter. Hoping I'm not missing anything and this makes sense 😄

Jul 14 '25 05:07 mumoshu

I agree with @mumoshu (great summary!)

We basically need to be smarter in the controller to determine when the pod is stuck and recreate it. The root cause of the issue is the cluster itself not having enough resources, but the controller should know how to recover from that.

Jul 14 '25 13:07 nikola-jokic

Hello, we are running into this issue as well. I have a PR - #4272 that takes a stab at resolving this.

Oct 12 '25 15:10 badstreff

I'm still running into this with the latest runner version - can anyone else confirm they are also still seeing this?

Oct 23 '25 18:10 badstreff