actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

EphemeralRunner and its pods left stuck Running after runner OOMKILL

Open kennedy-whytech opened this issue 5 months ago • 11 comments

Checks

  • [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [x] I am using charts that are officially provided

Controller Version

v0.12.1

Deployment Method

ArgoCD

Checks

  • [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Create an AutoScalingRunnerSet targeting an arm64 runner with limited memory.
2. Trigger a GitHub Actions job from a repository that consumes high memory during startup.
3. The pod will get OOMKilled, but:
- It remains in a Running state.
- The controller does not detect the failure.
- The EphemeralRunner CRD is not deleted.
4. Observe that new jobs remain stuck in the queued state due to the zombie runner.

Describe the bug

Ephemeral runner pods that are OOMKilled do not get properly cleaned up by the controller. Although the pod is no longer functioning (due to OOMKilled), it stays in a Running state and the associated EphemeralRunner CRD is not removed. This leads to zombie runners that block new job assignments, since the controller believes an active runner is still available.

(In v0.12.0, at least it's easier to detect the killed pod because the EphemeralRunner will be left without any pods )

Describe the expected behavior

When an ephemeral runner pod is OOMKilled, the controller should detect the failure, mark the associated EphemeralRunner CRD as failed, clean up the pod, and (optional)recreate a new runner if needed. This ensures no stale CRDs or zombie runners block new job assignments.

Additional Context

githubConfigUrl: https://github.com/<REDACTED>

controllerServiceAccount:
  namespace: arc-system
  name: arc-controller-gha-rs-controller

githubConfigSecret:
  github_app_id: <REDACTED>
  github_app_installation_id: <REDACTED>
  github_app_private_key: <REDACTED>

containerMode:
  type: "dind"

minRunners: 4

runnerGroup: "k8s"

template:
  spec:
    serviceAccountName: gha-runner
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        resources:
          requests:
            cpu: "2"
            memory: 8Gi
          limits:
            memory: 16Gi
...

Controller Logs

It still shows the runner is healthy after OOMKILL

2025-06-27T17:44:53Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}}
2025-06-27T17:44:53Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.12.1", "ephemeralrunner": {"name":"2cpu-runner-j4k8q","namespace":"arc-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": "", "ready": true}

Runner Pod Logs

containerStatuses:
    - containerID: >
        containerd://
      image: >
      imageID: >
      lastState: {}
      name: runner
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: '2025-06-27T16:39:21Z'


Logs
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:28Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:41:28Z ERR  GitHubActionsService] POST request to https://broker.actions.githubusercontent.com/session failed. HTTP Status: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] Catch exception during create session.
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: Error: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Common.BrokerServer.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Listener.BrokerMessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z ERR  Terminal] WRITE ERROR: A session for this runner already exists.
A session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session conflict exception haven't reached retry limit.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] Sleeping for 30 seconds before retrying.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Attempt to create session.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Broker Server...
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] stored True
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating type OAuth
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2025-06-27 16:41:58Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Runner server...
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:59Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:42:00Z INFO BrokerMessageListener] Session created.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Runner reconnected.
2025-06-27 16:42:00Z: Runner reconnected.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: Current runner version: '2.325.0'
Current runner version: '2.325.0'
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Listening for Jobs
2025-06-27 16:42:00Z: Listening for Jobs
[RUNNER 2025-06-27 16:42:00Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.

kennedy-whytech avatar Jun 27 '25 17:06 kennedy-whytech

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jun 27 '25 17:06 github-actions[bot]

Guess i won't update for now then, i'm in v0.12.0 and with the cronjob from @nimjor to delete zombie runners. I will follow the issue, thanks for opening!

andresrsanchez avatar Jun 30 '25 06:06 andresrsanchez

I am experiencing the same zombie runner problem on v0.12.0.

starcraft66 avatar Jul 03 '25 15:07 starcraft66

@starcraft66 we also suffered it, check this issue

andresrsanchez avatar Jul 04 '25 06:07 andresrsanchez

Hi, I still face this issue even on version v0.12.1

Tal-E avatar Jul 07 '25 20:07 Tal-E

Yes, same with version v0.12.1

rajesh-dhakad avatar Jul 10 '25 06:07 rajesh-dhakad

I see the same issue with other scenarios as well: listed down here: https://github.com/actions/actions-runner-controller/issues/4168#issuecomment-3060721651

rajesh-dhakad avatar Jul 11 '25 06:07 rajesh-dhakad

I think the conditions for pod regeneration should be strict: once a job is assigned and started, it should not be restarted with a clean runner, considering the idempotency of the job. If the conditions are not limited to pod scheduling failure due to lack of nodes, or idle runners being drained, we will end up with a runner pod that does nothing.

air-hand avatar Jul 28 '25 14:07 air-hand

Hi @nikola-jokic

I have re-produced this behavior with k3d and cgroup v2 . Actions: https://github.com/air-hand/k8s-playground/actions/runs/16581762528/job/46899483499 runner container spec: https://github.com/air-hand/k8s-playground/pull/12/files#diff-1060e9175697db9b5d7b27f35c5eb40453d00303678d8f5791b0f16c3eb0940dL32

$ kubectl get pod -w
NAME                                        READY   STATUS              RESTARTS   AGE
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     ContainerCreating   0          25s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     ContainerCreating   0          24s
k8s-playground-runners-xmzb2-runner-5ktjm   1/1     Running             0          32s
k8s-playground-runners-xmzb2-runner-jcf77   1/1     Running             0          31s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     OOMKilled           0          70s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     OOMKilled           0          69s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     Terminating         0          70s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     Terminating         0          69s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     OOMKilled           0          70s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     OOMKilled           0          71s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     OOMKilled           0          72s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     OOMKilled           0          72s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     OOMKilled           0          71s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     OOMKilled           0          71s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     Pending             0          0s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     Pending             0          0s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     Pending             0          0s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     Pending             0          0s
k8s-playground-runners-xmzb2-runner-5ktjm   0/1     ContainerCreating   0          0s
k8s-playground-runners-xmzb2-runner-jcf77   0/1     ContainerCreating   0          0s
k8s-playground-runners-xmzb2-runner-jcf77   1/1     Running             0          1s
k8s-playground-runners-xmzb2-runner-5ktjm   1/1     Running             0          1s

air-hand avatar Jul 28 '25 22:07 air-hand

Still present in v0.13.0

jecnua avatar Oct 29 '25 10:10 jecnua

Just trying to get some traction on this PR - #4272 which may resolve your issue. If you get a kubectl get ephemeralrunners do you see a lot of them sitting in a failed state?

badstreff avatar Nov 08 '25 16:11 badstreff