actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

EphemeralRunner left stuck Running after node drain/pod termination

Open tyrken opened this issue 6 months ago • 9 comments

Checks

  • [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [x] I am using charts that are officially provided

Controller Version

0.12.0

Deployment Method

Helm

Checks

  • [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner.  (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever

Describe the bug

While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.

This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:

#!/usr/bin/env bash

set -euo pipefail

STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
  | jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
  | tr '\n' ' ')

if [ -z "$STUCK_RUNNERS" ]; then
  echo "No stuck EphemeralRunners."
  exit 0
fi

echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS

Describe the expected behavior

For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.

The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.

Additional Context

Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml

Controller Logs

See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.

See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt

Runner Pod Logs

See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv

(Note copied from logging server as runner/pod is deleted during bug reproduction)

tyrken avatar Jun 26 '25 20:06 tyrken

@nikola-jokic - does this PR also fix the problem stated here? Thanks

luislongom avatar Jun 27 '25 10:06 luislongom

Same question as @luislongom

I haven't tried all scenario. With 0.12.1, I removed the runner pod manually and the ephemeralrunners were removed along with it afterwards. 👍

kennedy-whytech avatar Jun 27 '25 14:06 kennedy-whytech

Created a separate issue here.

With 0.12.1: However, when a container is OOMKILL, the pod just hangs there.

This might be even worse since it shows that it's healthy.

Here's the status

  containerStatuses:
    - containerID: >
        containerd://
      image: >
      imageID: >
      lastState: {}
      name: runner
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: '2025-06-27T16:39:21Z'
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:28Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:41:28Z ERR  GitHubActionsService] POST request to https://broker.actions.githubusercontent.com/session failed. HTTP Status: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] Catch exception during create session.
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: Error: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Common.BrokerServer.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR  BrokerMessageListener]    at GitHub.Runner.Listener.BrokerMessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z ERR  Terminal] WRITE ERROR: A session for this runner already exists.
A session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session conflict exception haven't reached retry limit.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] Sleeping for 30 seconds before retrying.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Attempt to create session.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Broker Server...
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] stored True
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating type OAuth
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2025-06-27 16:41:58Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Runner server...
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:59Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:42:00Z INFO BrokerMessageListener] Session created.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Runner reconnected.
2025-06-27 16:42:00Z: Runner reconnected.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: Current runner version: '2.325.0'
Current runner version: '2.325.0'
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Listening for Jobs
2025-06-27 16:42:00Z: Listening for Jobs
[RUNNER 2025-06-27 16:42:00Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.

kennedy-whytech avatar Jun 27 '25 16:06 kennedy-whytech

@kennedy-whytech - were you able to reproduce the same issue with 0.12.0? Thanks

luislongom avatar Jun 30 '25 08:06 luislongom

@luislongom In version 0.12.0, the issue appears slightly different — the pod was permanently killed, but the EphemeralRunners were left behind.

In version 0.12.1, the pod remains in a “healthy” state, but we still see those BrokerMessageListener messages and not able to pick up new jobs.

I’ve also seen similar behavior after Karpenter drain nodes and reschedule the pods- but not everytime

kennedy-whytech avatar Jun 30 '25 14:06 kennedy-whytech

We are experiencing the same issue on 0.12.1

We use Karpenter but we think it does not cause this issue. Our pods have enough ram and are never killed because of OOM.

Also, the stuck EphemeralRunners are pointing to workflow files in github, and after verification of those workflows, we saw no recent failing jobs, so there must be something else that causes EphemeralRunners to stuck in running

It is hard to reproduce, I will update this comment if I find something else

EDIT: correction, we had this issue on 0.12.0, not 0.12.1. We have upgraded all our clusters (we have 10+ ARC deployments) to 0.12.1 yesterday and saw no runners failing this way.

There was a fix merged into 0.12.1 that fixed this issue

Thank you!

alexanderkranga avatar Jul 02 '25 17:07 alexanderkranga

Hi, I still face this issue even on version v0.12.1

Tal-E avatar Jul 07 '25 20:07 Tal-E

I see the same issue with version v0.12.1 : listed down here: https://github.com/actions/actions-runner-controller/issues/4168#issuecomment-3060721651

rajesh-dhakad avatar Jul 11 '25 11:07 rajesh-dhakad

Hi,

We're also experiencing an issue with actions-runner-controller version 0.12.1. We're using Karpenter for autoscaling nodes, but this does not appear to be an autoscaler-related problem.

Scenario Description

When a new GitHub Actions workflow is triggered, an ephemeral runner pod is correctly created and begins executing the job. However, if the node is deleted (either manually or by Karpenter), the following behavior occurs:

The runner pod is terminated, and the GitHub runner is correctly deregistered. However, the corresponding EphemeralRunner resource remains in Running state. A new runner pod is then continuously spawned with the same name on a different node. Even though the GitHub workflow fails, the ephemeral runner continues to respawn indefinitely.

Additional Observations

The GitHub job being executed is simply sleep 5000, so this is not caused by OOMKilled. We're allocating 5 vCPUs and 17 GB of memory to both the runner and DinD containers. After node deletion, the original runner pod briefly enters a Completed state. The GitHub UI shows the job has failed, and there are no active workflow runs, yet the runner continues to respawn.

Image

After Node Deletion:

Runner pod goes to Completed EphemeralRunner remains in Running Node deletion proceeds normally

Image Image GitHub Workflow Fails: Image No Workflows Running, But Runner Keeps Respawning: Image Once the node is deleted, the EphemeralRunner controller does not correctly detect the runner pod termination. Despite the job failing and the GitHub runner being deregistered, the EphemeralRunner remains stuck in Running, causing new runner pods to be continuously created.

This leads to orphaned and perpetually recreated runner pods, even though no jobs are pending.

Thank you!

ArdiannS avatar Jul 21 '25 20:07 ArdiannS