EphemeralRunner left stuck Running after node drain/pod termination
Checks
- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided
Controller Version
0.12.0
Deployment Method
Helm
Checks
- [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner. (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever
Describe the bug
While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.
This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:
#!/usr/bin/env bash
set -euo pipefail
STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
| jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
| tr '\n' ' ')
if [ -z "$STUCK_RUNNERS" ]; then
echo "No stuck EphemeralRunners."
exit 0
fi
echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS
Describe the expected behavior
For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.
The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.
Additional Context
Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml
Controller Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.
See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt
Runner Pod Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv
(Note copied from logging server as runner/pod is deleted during bug reproduction)
@nikola-jokic - does this PR also fix the problem stated here? Thanks
Same question as @luislongom
I haven't tried all scenario. With 0.12.1, I removed the runner pod manually and the ephemeralrunners were removed along with it afterwards. 👍
Created a separate issue here.
With 0.12.1: However, when a container is OOMKILL, the pod just hangs there.
This might be even worse since it shows that it's healthy.
Here's the status
containerStatuses:
- containerID: >
containerd://
image: >
imageID: >
lastState: {}
name: runner
ready: true
restartCount: 0
started: true
state:
running:
startedAt: '2025-06-27T16:39:21Z'
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:28Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:28Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:41:28Z ERR GitHubActionsService] POST request to https://broker.actions.githubusercontent.com/session failed. HTTP Status: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] Catch exception during create session.
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: Error: Conflict
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Actions.RunService.WebApi.BrokerHttpClient.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Runner.Common.BrokerServer.CreateSessionAsync(TaskAgentSession session, CancellationToken cancellationToken)
[RUNNER 2025-06-27 16:41:28Z ERR BrokerMessageListener] at GitHub.Runner.Listener.BrokerMessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z ERR Terminal] WRITE ERROR: A session for this runner already exists.
A session for this runner already exists.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] The session conflict exception haven't reached retry limit.
[RUNNER 2025-06-27 16:41:28Z INFO BrokerMessageListener] Sleeping for 30 seconds before retrying.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Attempt to create session.
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Broker Server...
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] HasCredentials()
[RUNNER 2025-06-27 16:41:58Z INFO ConfigurationStore] stored True
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] GetCredentialProvider
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating type OAuth
[RUNNER 2025-06-27 16:41:58Z INFO CredentialManager] Creating credential type: OAuth
[RUNNER 2025-06-27 16:41:58Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:58Z INFO BrokerMessageListener] Connecting to the Runner server...
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 100 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] EstablishVssConnection
[RUNNER 2025-06-27 16:41:58Z INFO RunnerServer] Establish connection with 60 seconds timeout.
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:58Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[RUNNER 2025-06-27 16:41:59Z INFO BrokerMessageListener] VssConnection created
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
√ Connected to GitHub
[RUNNER 2025-06-27 16:41:59Z INFO Terminal] WRITE LINE:
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-06-27 16:41:59Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-27 16:42:00Z INFO BrokerMessageListener] Session created.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Runner reconnected.
2025-06-27 16:42:00Z: Runner reconnected.
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: Current runner version: '2.325.0'
Current runner version: '2.325.0'
[RUNNER 2025-06-27 16:42:00Z INFO Terminal] WRITE LINE: 2025-06-27 16:42:00Z: Listening for Jobs
2025-06-27 16:42:00Z: Listening for Jobs
[RUNNER 2025-06-27 16:42:00Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
@kennedy-whytech - were you able to reproduce the same issue with 0.12.0? Thanks
@luislongom In version 0.12.0, the issue appears slightly different — the pod was permanently killed, but the EphemeralRunners were left behind.
In version 0.12.1, the pod remains in a “healthy” state, but we still see those BrokerMessageListener messages and not able to pick up new jobs.
I’ve also seen similar behavior after Karpenter drain nodes and reschedule the pods- but not everytime
We are experiencing the same issue on 0.12.1
We use Karpenter but we think it does not cause this issue. Our pods have enough ram and are never killed because of OOM.
Also, the stuck EphemeralRunners are pointing to workflow files in github, and after verification of those workflows, we saw no recent failing jobs, so there must be something else that causes EphemeralRunners to stuck in running
It is hard to reproduce, I will update this comment if I find something else
EDIT: correction, we had this issue on 0.12.0, not 0.12.1. We have upgraded all our clusters (we have 10+ ARC deployments) to 0.12.1 yesterday and saw no runners failing this way.
There was a fix merged into 0.12.1 that fixed this issue
Thank you!
Hi, I still face this issue even on version v0.12.1
I see the same issue with version v0.12.1 : listed down here: https://github.com/actions/actions-runner-controller/issues/4168#issuecomment-3060721651
Hi,
We're also experiencing an issue with actions-runner-controller version 0.12.1. We're using Karpenter for autoscaling nodes, but this does not appear to be an autoscaler-related problem.
Scenario Description
When a new GitHub Actions workflow is triggered, an ephemeral runner pod is correctly created and begins executing the job. However, if the node is deleted (either manually or by Karpenter), the following behavior occurs:
The runner pod is terminated, and the GitHub runner is correctly deregistered. However, the corresponding EphemeralRunner resource remains in Running state. A new runner pod is then continuously spawned with the same name on a different node. Even though the GitHub workflow fails, the ephemeral runner continues to respawn indefinitely.
Additional Observations
The GitHub job being executed is simply sleep 5000, so this is not caused by OOMKilled. We're allocating 5 vCPUs and 17 GB of memory to both the runner and DinD containers. After node deletion, the original runner pod briefly enters a Completed state. The GitHub UI shows the job has failed, and there are no active workflow runs, yet the runner continues to respawn.
After Node Deletion:
Runner pod goes to Completed EphemeralRunner remains in Running Node deletion proceeds normally
This leads to orphaned and perpetually recreated runner pods, even though no jobs are pending.
Thank you!