actions-runner-controller Runners are being removed for being idle before its job has had a chance to be assigned to it

Checks

[x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[x] I am using charts that are officially provided

Controller Version

0.10.1 and upwards

Deployment Method

Helm

Checks

[x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

To reproduce, simply deploy the scale sets as normal (I was using the Quickstart Guide), and begin running jobs. No change was made to our K8S cluster or the docker images we were using for the runners before this bug began.

Describe the bug

Ephemeral runners are correctly brought up and begin advertising themselves to the repository/organisation as expected, however if a job hasn't begun running on them within 10 seconds, the ARC will kill the runners because it thinks they're idle.

While the data below refers to Windows runners, we also have Ubuntu runners where i've observed the issue happening - just with much less frequency (around 5/10% of the time).

Describe the expected behavior

The controller should wait a bit longer before killing the jobs because they are idle. The fact that jobs are assigned correctly approx. 50% of the time implies there's a tiny threshold that's being missed somwhere along the line. Unfortunately I can't control the delay at which GitHub will recognise there's now a free runner that's come online, but it would be helpful if the controller didn't wait for what seems as little as 10 seconds after creation before it kills a runner for being apparently Idle.

Additional Context

githubConfigUrl: https://github.com/redacted
githubConfigSecret: redacted
runnerGroup: redacted
minRunners: 1
template:
  spec:
    containers:
      - name: runner
        image: redacted
        command: ["run.cmd"]
    serviceAccountName: redacted
    nodeSelector: # Ensures the pods can only run on nodes that have this label
      runner-os: windows
      iam.gke.io/gke-metadata-server-enabled: "true"
    tolerations: # Ensures that the pods can only run on nodes that have this taint
      - key: runners-fooding
        operator: Equal
        value: "true"
        effect: NoSchedule
      - key: node.kubernetes.io/os
        operator: Equal
        value: "windows"
        effect: NoSchedule

Controller Logs

https://gist.github.com/JohnLBergqvist/46553ba6043449e704af88f1a706228e

Runner Pod Logs

Logs: 

√ Connected to GitHub

Current runner version: '2.323.0'
2025-03-27 20:27:35Z: Listening for Jobs


Describe output

Name:             redacted-m2xmj-runner-2sb5k
Namespace:        arc-runners
Priority:         0
Service Account:  redacted
Node:             gke-49a8bb-scng/10.128.0.10
Start Time:       Thu, 27 Mar 2025 20:23:24 +0000
Labels:           actions-ephemeral-runner=True
                  actions.github.com/organization=redacted
                  actions.github.com/scale-set-name=redacted
                  actions.github.com/scale-set-namespace=arc-runners
                  app.kubernetes.io/component=runner
                  app.kubernetes.io/instance=redacted
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=redacted
                  app.kubernetes.io/part-of=gha-runner-scale-set
                  app.kubernetes.io/version=0.11.0
                  helm.sh/chart=gha-rs-0.11.0
                  pod-template-hash=79798d59cd
Annotations:      actions.github.com/patch-id: 0
                  actions.github.com/runner-group-name: Cover
                  actions.github.com/runner-scale-set-name: redacted
                  actions.github.com/runner-spec-hash: 78d4b6447
Status:           Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:               10.36.2.11
IPs:
  IP:           10.36.2.11
Controlled By:  EphemeralRunner/redacted-m2xmj-runner-2sb5k
Containers:
  runner:
    Container ID:  containerd://redacted
    Image:         redacted
    Image ID:      redacted@sha256:redacted
    Port:          <none>
    Host Port:     <none>
    Command:
      run.cmd
    State:          Running
      Started:      Thu, 27 Mar 2025 20:27:30 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     2
      memory:  10Gi
    Environment:
      ACTIONS_RUNNER_INPUT_JITCONFIG:          <set to the key 'jitToken' in secret 'redacted-m2xmj-runner-2sb5k'>  Optional: false
      GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/0.11.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-clv4p (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-clv4p:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              iam.gke.io/gke-metadata-server-enabled=true
                             runner-os=windows
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/os=windows:NoSchedule
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             runners-fooding=true:NoSchedule
Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   Scheduled         4m21s                  default-scheduler   Successfully assigned arc-runners/redacted-m2xmj-runner-2sb5k to gke-49a8bb-scng
  Normal   Pulling           4m19s                  kubelet             Pulling image "redacted"
  Normal   Pulled            18s                    kubelet             Successfully pulled image "redacted" in 4m1.518s (4m1.518s including waiting). Image size: 3372778201 bytes.
  Normal   Created           18s                    kubelet             Created container: runner
  Normal   Started           15s                    kubelet             Started container runner
  Normal   Killing           5s                     kubelet             Stopping container runner

Mar 27 '25 22:03 JohnLBergqvist

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Mar 27 '25 22:03 github-actions[bot]

The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.

It's not the other way around that a runner pod sits idle and waits to pickup a job unless you're setting the minRunner field to preemptively scale pods and even in that case I don't see this behaviour.

Mar 30 '25 21:03 pulkitanz

The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.

This is correct, however the runner is only active for about 5 seconds before then being killed by the listener, before a job has started running on it.

What's happening is this:

Listener: 2025-03-27T20:19:43Z Creating new ephemeral runners (scale up)
GitHub job status: waiting for runner to become available
Listener: 2025-03-27T20:27:30Z Updating ephemeral runner status "ready": true
Runner: 2025-03-27 20:27:35Z: Listening for Jobs
Listener: 2025-03-27T20:27:40Z Removing the idle ephemeral runner

So yes briefly the runner pod does sit idle, for a few seconds before a job is assigned to it. Except the listener process is kililng the pod a little too soon.

When a job is scheduled and none of the runners of that type are available, so the job itself will sit in a queued state with the job page saying "Waiting for a runner matching [runner-group] to become available".
In the mean-time, the actions runner controller will scale up the runners to create the amount of ephemeral runners needed.
Those runners then come online, and the job will begin because a matching runner is now available. However in my case, the listener is killing the runner before the job has had a chance to begin on it, because it thinks the runner is sitting idle.

For example, i can see that the listener brings a pod online at 09:49:52

  {
    "lastProbeTime": null,
    "lastTransitionTime": "2025-03-31T09:49:52Z",
    "status": "True",
    "type": "Ready"
  }
]

Which is the point at which the listener sees the runner as ready.

However there is a further delay of 5 seconds before the runner itself connects to GitHub and begins listening for jobs, a few more seconds later, the job begins running on that runner.

√ Connected to GitHub

Current runner version: '2.323.0'
2025-03-31 09:49:57Z: Listening for Jobs
2025-03-31 09:50:08Z: Running job

Maybe the listener should check to see that GitHub has successfully registered the self-hosted runner before it classes it as running?

Mar 31 '25 07:03 JohnLBergqvist

What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?

edit: Also, I'm not from Github, I'm building ephemeral runners too and this is not an issue for me so just trying to help out.

Mar 31 '25 23:03 pulkitanz

What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?

Yes they are.

Apr 01 '25 08:04 JohnLBergqvist

@JohnLBergqvist I believe we're experiencing this as well and it started randomly in the last couple weeks after not touching anything related to this for many months. Did you ever figure out any workaround aside from forcing a pod to always be available?

Apr 03 '25 02:04 andrewbeckman

@andrewbeckman Unfortunately not. I've noticed it seems to happen more if only a single job is queued after a relative period of inactivity. If multiple jobs are queued in a short space of time then there's a higher chance that more of them will schedule correctly - perhaps because the Controller's main loop is taking longer to finish, thus giving the runners more breathing room to accept a job?

Apr 03 '25 09:04 JohnLBergqvist

I believe we are also experiencing this issue. Have not tried the workaround of setting minRunners > 0.

In our case the behavior on the Github UI side is that jobs randomly show as "Canceled" despite nobody canceling them/no subsequent pushes to a PR that would cancel a job.

I'm curious about these lines in your log snippet (we are seeing the same):

EphemeralRunner    Checking if runner exists in GitHub service ...
EphemeralRunner    Runner does not exist in GitHub service

As opposed to a race condition between the runner controller and the runner pod, that could point to an issue with the Github Actions Service API where it is not reporting status correctly and therefore both the runner controller and the runner pod itself are behaving "correctly" in the sense that there's nothing on the Github Service side for them to act on.

That would also jibe with the fact that this has popped up in the last couple weeks despite no apparent local changes.

Apr 15 '25 17:04 patrickvinograd

In other words, this could be a problem with the GH Actions service, not with the ARC project.

Apr 15 '25 17:04 patrickvinograd

In other words, this could be a problem with the GH Actions service, not with the ARC project.

Yes, but the ARC should take this into account if this is the new default for GitHub Actions itself going forward.

Apr 24 '25 12:04 JohnLBergqvist

This bug is still happening for us as of today. @nikola-jokic Can you provide any update on this, as it's preventing us from using the controller effectively.

Jun 06 '25 08:06 JohnLBergqvist