Runners are being removed for being idle before its job has had a chance to be assigned to it
Checks
- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided
Controller Version
0.10.1 and upwards
Deployment Method
Helm
Checks
- [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
To reproduce, simply deploy the scale sets as normal (I was using the Quickstart Guide), and begin running jobs. No change was made to our K8S cluster or the docker images we were using for the runners before this bug began.
Describe the bug
Ephemeral runners are correctly brought up and begin advertising themselves to the repository/organisation as expected, however if a job hasn't begun running on them within 10 seconds, the ARC will kill the runners because it thinks they're idle.
While the data below refers to Windows runners, we also have Ubuntu runners where i've observed the issue happening - just with much less frequency (around 5/10% of the time).
Describe the expected behavior
The controller should wait a bit longer before killing the jobs because they are idle. The fact that jobs are assigned correctly approx. 50% of the time implies there's a tiny threshold that's being missed somwhere along the line. Unfortunately I can't control the delay at which GitHub will recognise there's now a free runner that's come online, but it would be helpful if the controller didn't wait for what seems as little as 10 seconds after creation before it kills a runner for being apparently Idle.
Additional Context
githubConfigUrl: https://github.com/redacted
githubConfigSecret: redacted
runnerGroup: redacted
minRunners: 1
template:
spec:
containers:
- name: runner
image: redacted
command: ["run.cmd"]
serviceAccountName: redacted
nodeSelector: # Ensures the pods can only run on nodes that have this label
runner-os: windows
iam.gke.io/gke-metadata-server-enabled: "true"
tolerations: # Ensures that the pods can only run on nodes that have this taint
- key: runners-fooding
operator: Equal
value: "true"
effect: NoSchedule
- key: node.kubernetes.io/os
operator: Equal
value: "windows"
effect: NoSchedule
Controller Logs
https://gist.github.com/JohnLBergqvist/46553ba6043449e704af88f1a706228e
Runner Pod Logs
Logs:
√ Connected to GitHub
Current runner version: '2.323.0'
2025-03-27 20:27:35Z: Listening for Jobs
Describe output
Name: redacted-m2xmj-runner-2sb5k
Namespace: arc-runners
Priority: 0
Service Account: redacted
Node: gke-49a8bb-scng/10.128.0.10
Start Time: Thu, 27 Mar 2025 20:23:24 +0000
Labels: actions-ephemeral-runner=True
actions.github.com/organization=redacted
actions.github.com/scale-set-name=redacted
actions.github.com/scale-set-namespace=arc-runners
app.kubernetes.io/component=runner
app.kubernetes.io/instance=redacted
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=redacted
app.kubernetes.io/part-of=gha-runner-scale-set
app.kubernetes.io/version=0.11.0
helm.sh/chart=gha-rs-0.11.0
pod-template-hash=79798d59cd
Annotations: actions.github.com/patch-id: 0
actions.github.com/runner-group-name: Cover
actions.github.com/runner-scale-set-name: redacted
actions.github.com/runner-spec-hash: 78d4b6447
Status: Terminating (lasts <invalid>)
Termination Grace Period: 30s
IP: 10.36.2.11
IPs:
IP: 10.36.2.11
Controlled By: EphemeralRunner/redacted-m2xmj-runner-2sb5k
Containers:
runner:
Container ID: containerd://redacted
Image: redacted
Image ID: redacted@sha256:redacted
Port: <none>
Host Port: <none>
Command:
run.cmd
State: Running
Started: Thu, 27 Mar 2025 20:27:30 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 2
memory: 10Gi
Environment:
ACTIONS_RUNNER_INPUT_JITCONFIG: <set to the key 'jitToken' in secret 'redacted-m2xmj-runner-2sb5k'> Optional: false
GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT: actions-runner-controller/0.11.0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-clv4p (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-clv4p:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: iam.gke.io/gke-metadata-server-enabled=true
runner-os=windows
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/os=windows:NoSchedule
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
runners-fooding=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m21s default-scheduler Successfully assigned arc-runners/redacted-m2xmj-runner-2sb5k to gke-49a8bb-scng
Normal Pulling 4m19s kubelet Pulling image "redacted"
Normal Pulled 18s kubelet Successfully pulled image "redacted" in 4m1.518s (4m1.518s including waiting). Image size: 3372778201 bytes.
Normal Created 18s kubelet Created container: runner
Normal Started 15s kubelet Started container runner
Normal Killing 5s kubelet Stopping container runner
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.
It's not the other way around that a runner pod sits idle and waits to pickup a job unless you're setting the minRunner field to preemptively scale pods and even in that case I don't see this behaviour.
The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.
This is correct, however the runner is only active for about 5 seconds before then being killed by the listener, before a job has started running on it.
What's happening is this:
Listener: 2025-03-27T20:19:43Z Creating new ephemeral runners (scale up)
GitHub job status: waiting for runner to become available
Listener: 2025-03-27T20:27:30Z Updating ephemeral runner status "ready": true
Runner: 2025-03-27 20:27:35Z: Listening for Jobs
Listener: 2025-03-27T20:27:40Z Removing the idle ephemeral runner
So yes briefly the runner pod does sit idle, for a few seconds before a job is assigned to it. Except the listener process is kililng the pod a little too soon.
- When a job is scheduled and none of the runners of that type are available, so the job itself will sit in a
queuedstate with the job page saying "Waiting for a runner matching [runner-group] to become available". - In the mean-time, the actions runner controller will scale up the runners to create the amount of ephemeral runners needed.
- Those runners then come online, and the job will begin because a matching runner is now available. However in my case, the listener is killing the runner before the job has had a chance to begin on it, because it thinks the runner is sitting idle.
For example, i can see that the listener brings a pod online at 09:49:52
{
"lastProbeTime": null,
"lastTransitionTime": "2025-03-31T09:49:52Z",
"status": "True",
"type": "Ready"
}
]
Which is the point at which the listener sees the runner as ready.
However there is a further delay of 5 seconds before the runner itself connects to GitHub and begins listening for jobs, a few more seconds later, the job begins running on that runner.
√ Connected to GitHub
Current runner version: '2.323.0'
2025-03-31 09:49:57Z: Listening for Jobs
2025-03-31 09:50:08Z: Running job
Maybe the listener should check to see that GitHub has successfully registered the self-hosted runner before it classes it as running?
What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?
edit: Also, I'm not from Github, I'm building ephemeral runners too and this is not an issue for me so just trying to help out.
What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?
Yes they are.
@JohnLBergqvist I believe we're experiencing this as well and it started randomly in the last couple weeks after not touching anything related to this for many months. Did you ever figure out any workaround aside from forcing a pod to always be available?
@andrewbeckman Unfortunately not. I've noticed it seems to happen more if only a single job is queued after a relative period of inactivity. If multiple jobs are queued in a short space of time then there's a higher chance that more of them will schedule correctly - perhaps because the Controller's main loop is taking longer to finish, thus giving the runners more breathing room to accept a job?
I believe we are also experiencing this issue. Have not tried the workaround of setting minRunners > 0.
In our case the behavior on the Github UI side is that jobs randomly show as "Canceled" despite nobody canceling them/no subsequent pushes to a PR that would cancel a job.
I'm curious about these lines in your log snippet (we are seeing the same):
EphemeralRunner Checking if runner exists in GitHub service ...
EphemeralRunner Runner does not exist in GitHub service
As opposed to a race condition between the runner controller and the runner pod, that could point to an issue with the Github Actions Service API where it is not reporting status correctly and therefore both the runner controller and the runner pod itself are behaving "correctly" in the sense that there's nothing on the Github Service side for them to act on.
That would also jibe with the fact that this has popped up in the last couple weeks despite no apparent local changes.
In other words, this could be a problem with the GH Actions service, not with the ARC project.
In other words, this could be a problem with the GH Actions service, not with the ARC project.
Yes, but the ARC should take this into account if this is the new default for GitHub Actions itself going forward.
This bug is still happening for us as of today. @nikola-jokic Can you provide any update on this, as it's preventing us from using the controller effectively.