actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Failed EphemeralRunners block launching new pods

Open igaskin opened this issue 1 year ago • 3 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.8.3

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Trigger a `FailedScheduling` event.
2. Wait for 5 failures in pod scheduling.
3. Recover the cluster.
4. New ephemeral runner pods will not be scheduled to meet capacity.

Describe the bug

When EphemeralRunners are in Failed state they get stuck in that state, which prevents other pods from being launched. This issue has been previously noted in these discussions.

status:
  currentRunners: 17
  failedEphemeralRunners: 16
  pendingEphemeralRunners: 0
  runningEphemeralRunners: 1 

https://github.com/actions/actions-runner-controller/discussions/3300 https://github.com/actions/actions-runner-controller/discussions/3610

Describe the expected behavior

Failed Ephemeral runners will be cleared, so scheduling can be retired.

Additional Context

https://github.com/actions/actions-runner-controller/discussions/3610
https://github.com/actions/actions-runner-controller/discussions/3300

Controller Logs

2024-06-20T19:18:03Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "my-scaleset-ns", "name": "my-runner-6pzbd", "replicas": 3}
2024-06-20T19:18:03Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-06-20T19:18:11Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 14}
2024-06-20T19:18:53Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-06-20T19:19:01Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 14}

Runner Pod Logs

2024-06-21T16:22:44Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "my-scaleset", "name": "my-runner-rpvp2", "replicas": 10}

igaskin avatar Jul 26 '24 21:07 igaskin

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jul 26 '24 21:07 github-actions[bot]

This happened on me recently as well when I upgrade to 0.9.3 with github application. My situation is all the ephemeral runners were stuck in state of terminating status.

singlewind avatar Jul 30 '24 00:07 singlewind

I've also noticed that a failedEphemeralRunners will consume a min-runner slot. We have to schedule cleaning up these resources otherwise we will experience queue time degradation due to insufficient min-runners.

shanesavoie avatar Sep 27 '24 16:09 shanesavoie

Pretty sure we are running into this issue even with 0.10.1

avadhanij avatar Apr 10 '25 06:04 avadhanij

In short, runner pods in a failed state need to be investigated as they could indicate underlying cluster related problems that might otherwise go unnoticed. That is why they were not cleared automatically by the operator.

In an upcoming release, we'll revisit this behaviour to require the least amount of intervention especially manual ones.

Link- avatar Apr 21 '25 09:04 Link-

In short, runner pods in a failed state need to be investigated as they could indicate underlying cluster related problems that might otherwise go unnoticed. That is why they were not cleared automatically by the operator.

In an upcoming release, we'll revisit this behaviour to require the least amount of intervention especially manual ones.

Hi @Link- is there a time schedule for the release which fix the issue?

pkking avatar May 12 '25 09:05 pkking

A couple of weeks, tops. The PR with the change is being merged as we speak.

Link- avatar May 14 '25 12:05 Link-

@nikola-jokic is a release being created soon? We are experiencing this issue

clarker-anz avatar Jun 06 '25 01:06 clarker-anz

Hey @clarker-anz, the next release should be next week.

nikola-jokic avatar Jun 06 '25 09:06 nikola-jokic