actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

EphemeralRunners are stuck in failed state after the job succeeds

Open Dawnflash opened this issue 5 months ago • 1 comments

Checks

  • [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [x] I am using charts that are officially provided

Controller Version

0.12.0

Deployment Method

ArgoCD

Checks

  • [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Not consistently reproducible, there's a small chance for it to happen whenever a job succeeds.
Just run some workload, then check that no EphemeralRunners are stuck in a failed state.

Describe the bug

After the workload succeeds the controller marks the EphemeralRunner as failed and it doesn't create a new pod, just hangs in a failed state until manually removed. There is always exactly 1 failure with a timestamp: "status": { "failures": {"<uuid>": "<timestamp>"}}.

It probably lingers in the Github API a bit longer after the pod dies and the controller treats it as a failure. It calls deletePodAsFailed which is what's visible in the log excerpt.

After that it goes into backoff but it is never processed again. After the backoff period elapses there are no further logs available referencing the EphemeralRunner and it remains stuck and unmanaged.

For now we are removing these orphans periodically but the orphans seem to negatively impact CI job startup times.

The runners are eventually removed from Github API because I manually checked them and they were no longer present in Github. Yet the EphemeralRunners remain stuck.

Describe the expected behavior

The EphemeralRunner is cleanly removed once Github releases it. It should keep reconciling after the backoff period elapses instead of giving up on it silently.

Additional Context

Using a simple runnerset with DinD mode, cloud Github organization installation (via Github app).

Controller Logs

https://gist.github.com/Dawnflash/0a3fc1da0f99dfe67fc17b6987821a53

Runner Pod Logs

Don't have those but the jobs normally succeed. All green in Github.

Dawnflash avatar Jun 18 '25 17:06 Dawnflash

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jun 18 '25 17:06 github-actions[bot]

We are suffering from the same issue

kaplan-shaked avatar Jun 19 '25 09:06 kaplan-shaked

@nikola-jokic I think this is something connected to the change from this pr https://github.com/actions/actions-runner-controller/pull/4059 ,w e are seeing the same, the pod is get deleted, but the EphemeralRunner object is keep in the state running like this

Status:
  Failures:
    9b0c5e46-bf2c-41cc-90f2-6fcc4f57f599:  2025-06-19T11:29:42Z
  Job Repository Name:                     xxxxshared-workflows-github-actions
  Job Workflow Ref:                        xxxxx.yaml@refs/heads/main
  Phase:                                   Running
  Ready:                                   false
  Runner Id:                               15718317

kyrylomiro avatar Jun 19 '25 11:06 kyrylomiro

Hey, this might not be related to the actual controller change. Looking at the log, we see that the ephemeral runner finishes, but it exists. That shouldn't happen. After the ephemeral runner is done executing the job, it should self-delete. Therefore, the issue might be on the back-end side. Can you please share the workflow run URL?

nikola-jokic avatar Jun 19 '25 12:06 nikola-jokic

Hey, is anyone running ARC with a version older than 0.12.0 and experiencing this?

nikola-jokic avatar Jun 19 '25 14:06 nikola-jokic

@nikola-jokic yes, I can share the workflow URL, but just before I want to show you how bad is the situation, this is all our runners right now. https://gist.github.com/kyrylomiro/64c559e7d3608fd459443f4a25328c12 so all of them that has errors, actually doesn't have pods, but the state keeps saying it's running, and actually our scheduling time now is reaching to n minutes. Workflow URL, trying to find it now.

kyrylomiro avatar Jun 19 '25 15:06 kyrylomiro

@nikola-jokic this is the url and exact job the cause the runner to end up in m-runner-hvlmr-runner-4x2cs Running map[53171800-7476-4af6-8a7b-00f286b15671:2025-06-19T15:13:22Z]

kyrylomiro avatar Jun 19 '25 16:06 kyrylomiro

@nikola-jokic the run is this one where github is showing that workflow is still running but runner is already gone

kyrylomiro avatar Jun 19 '25 18:06 kyrylomiro

I can corroborate this issue, was going to open it myself but didn't get a chance to yet. The EphemeralRunners exist indefinitely with a .status.phase: Running. I'll share the cronjob setup I added to buy me time to continue investigating without blocking our users' job startups:

zombie-runner-cleanup.yaml

(obviously change the namespace as needed for your env)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - list
  - apiGroups:
      - actions.github.com
    resources:
      - ephemeralrunners
    verbs:
      - get
      - list
      - delete
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
subjects:
  - kind: ServiceAccount
    name: zombie-runner-cleanup
    namespace: gha-runners
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: zombie-runner-cleanup
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: zombie-runner-cleanup
  namespace: gha-runners
spec:
  schedule: "*/10 * * * *" # Every 10 minutes
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: zombie-runner-cleanup
          containers:
            - name: zombie-runner-cleanup
              image: alpine:3
              command:
                - /bin/sh
                - -c
                - |
                  apk add kubectl yq
                  #!/bin/sh
                  set -e

                  echo ""
                  echo "Starting cleanup task..."

                  PODS_FILE="/tmp/pods.txt"
                  RUNNERS_FILE="/tmp/runners.txt"
                  DIFF_FILE="/tmp/runners_diff.txt"
                  NS="gha-runners"
                  SELECTOR="app.kubernetes.io/part-of=gha-runner-scale-set"

                  kubectl -n $NS get pods -l $SELECTOR -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' > $PODS_FILE
                  kubectl -n $NS get ephemeralrunners -o yaml | yq '.items[] | select(.status.phase == "Running") | .metadata.name' > $RUNNERS_FILE

                  ## Subtract Pods from running EphemeralRunners to find runners that no longer have a pod
                  comm -13 <(sort $PODS_FILE) <(sort $RUNNERS_FILE) > $DIFF_FILE

                  echo "Runner pods: $(wc -l $PODS_FILE | awk -F' ' '{print $1}')"
                  echo "Ephemeral runners: $(wc -l $RUNNERS_FILE | awk -F' ' '{print $1}')"
                  echo "Found $(wc -l $DIFF_FILE | awk -F' ' '{print $1}') ephemeral runners without pods"

                  for runner in $(cat $DIFF_FILE); do 
                      kubectl -n $NS delete ephemeralrunner $runner
                  done
                  rm $PODS_FILE $RUNNERS_FILE $DIFF_FILE

                  echo "Done."

          restartPolicy: OnFailure

This issue was not happening to our runners in 0.11.0. And it is not intermittent, in the sense that the overall issue doesn't come and go by the day, it is always affecting some percentage of our runners, but it IS intermittent in the sense that it seems to happen randomly to our jobs, with no discernible difference. It seems to affect roughly 2-20 jobs an hour for us. If we don't clean them up, the controller seems to be counting those as part of the current scale metric so it doesn't think it needs more runners added to meet demand, thus the increasing length of job queue.

nimjor avatar Jun 19 '25 20:06 nimjor

@nikola-jokic I don't know if there are others on 0.11.0 still who are experiencing this same issue, but I doubt it, based on the fact that immediately after upgrading ours from 0.11.0 to 0.12.0 this issue started. The disappointing part is we eagerly upgraded because of https://github.com/actions/actions-runner-controller/issues/3685, only to get hit with this arguably worse bug.

nimjor avatar Jun 20 '25 12:06 nimjor

@nimjor we upgraded from 0.9.3 also because of that bug, now i don't know which one i prefer lol i can confirm that with your script we didn't have issues this morning, let's see thanks!

andresrsanchez avatar Jun 20 '25 13:06 andresrsanchez

In case it's helpful, I can share the controller logs for one example runner before the upgrade and one after.

We use Loki for logging, so these are the controller logs filtered by the EphemeralRunner name for relevancy, like:

{namespace="gha-runners", pod=~"github-runner-controller-gha-rs-controller-.*"} |= `<EPHEMERALRUNNER_NAME_HERE>`

0.11.0

📗 Controller Logs
2025-06-14T12:01:23Z	INFO	EphemeralRunnerSet	Created new ephemeral runner	{"version": "0.11.0", "ephemeralrunnerset": {"name":"redacted-ubuntu-blue-r8m4f","namespace":"gha-runners"}, "runner": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Adding finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Successfully added finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Adding runner registration finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Successfully added runner registration finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:23Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Created ephemeral runner JIT config	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerId": 10419516}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Updating ephemeral runner status with runnerId and runnerJITConfig	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Updated ephemeral runner status with runnerId and runnerJITConfig	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Creating new ephemeral runner secret for jitconfig.	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Creating new secret for ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Created new secret spec for ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Created ephemeral runner secret	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "secretName": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerScaleSetId": 263, "runnerName": "redacted-ubuntu-blue-r8m4f-runner-ljvb4", "runnerId": 10419516, "configUrl": "https://github.com/redacted", "podName": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": "", "ready": false}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Updated ephemeral runner status	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:24Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:26Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:26Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": "", "ready": true}
2025-06-14T12:01:26Z	INFO	EphemeralRunner	Updated ephemeral runner status	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:01:26Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:03Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerId": 10419516}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Runner does not exist in GitHub service	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerId": 10419516}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Ephemeral runner has finished since it does not exist in the service anymore	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Updating ephemeral runner status to Finished	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	EphemeralRunner status is marked as Finished	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Cleaning up resources after after ephemeral runner termination	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "phase": "Succeeded"}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Deleting the runner pod	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunnerSet	Deleting finished ephemeral runner	{"version": "0.11.0", "ephemeralrunnerset": {"name":"redacted-ubuntu-blue-r8m4f","namespace":"gha-runners"}, "name": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Deleted the runner pod	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Cleaning up the runner jitconfig secret	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Deleting the jitconfig secret	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Deleted jitconfig secret	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	EphemeralRunner has already finished. Stopping reconciliation and waiting for EphemeralRunnerSet to clean it up	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "phase": "Succeeded"}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Trying to clean up runner from the service	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Removing runner from the service	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerId": 10419516}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Removed runner from the service	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}, "runnerId": 10419516}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Runner is cleaned up from the service, removing finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Removed finalizer from ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Finalizing ephemeral runner	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Cleaning up the runner pod	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Pod contains deletion timestamp	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Cleaning up the runner jitconfig secret	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Runner jitconfig secret is deleted	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Removing finalizer	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}
2025-06-14T12:03:54Z	INFO	EphemeralRunner	Successfully removed finalizer after cleanup	{"version": "0.11.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-r8m4f-runner-ljvb4","namespace":"gha-runners"}}

0.12.0

📕 Controller Logs
2025-06-18T14:37:01Z	INFO	EphemeralRunnerSet	Created new ephemeral runner	{"version": "0.12.0", "ephemeralrunnerset": {"name":"redacted-ubuntu-blue-pthw7","namespace":"gha-runners"}, "runner": "redacted-ubuntu-blue-pthw7-runner-r94hn"}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Adding finalizer	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Successfully added finalizer	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Adding runner registration finalizer	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Successfully added runner registration finalizer	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:01Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Created ephemeral runner JIT config	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "runnerId": 10485340}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Updating ephemeral runner status with runnerId and runnerJITConfig	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Updated ephemeral runner status with runnerId and runnerJITConfig	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Creating new ephemeral runner secret for jitconfig.	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Creating new secret for ephemeral runner	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Created new secret spec for ephemeral runner	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Created ephemeral runner secret	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "secretName": "redacted-ubuntu-blue-pthw7-runner-r94hn"}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "runnerScaleSetId": 321, "runnerName": "redacted-ubuntu-blue-pthw7-runner-r94hn", "runnerId": 10485340, "configUrl": "https://github.com/redacted", "podName": "redacted-ubuntu-blue-pthw7-runner-r94hn"}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": "", "ready": false}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Updated ephemeral runner status	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:02Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:03Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:03Z	INFO	EphemeralRunner	Updating ephemeral runner status	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "statusPhase": "Running", "statusReason": "", "statusMessage": "", "ready": true}
2025-06-18T14:37:03Z	INFO	EphemeralRunner	Updated ephemeral runner status	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:37:03Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:38:09Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "runnerId": 10485340}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Runner exists in GitHub service	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "runnerId": 10485340}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Ephemeral runner pod has finished, but the runner still exists in the service. Deleting the pod to restart it.	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Deleting the ephemeral runner pod	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "podId": "d0689dc0-2c50-4f0d-898f-6502ec13d61e"}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Updating ephemeral runner status to track the failure count	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	EphemeralRunner pod is deleted and status is updated with failure count	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}}
2025-06-18T14:38:11Z	INFO	EphemeralRunner	Backing off the next reconciliation due to failure	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "lastFailure": "2025-06-18 14:38:11 +0000 UTC", "nextReconciliation": "2025-06-18T14:38:16Z", "requeueAfter": "4.360690333s"}
2025-06-18T14:38:12Z	INFO	EphemeralRunner	Backing off the next reconciliation due to failure	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "lastFailure": "2025-06-18 14:38:11 +0000 UTC", "nextReconciliation": "2025-06-18T14:38:16Z", "requeueAfter": "3.312928408s"}
2025-06-18T14:38:13Z	INFO	EphemeralRunner	Backing off the next reconciliation due to failure	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "lastFailure": "2025-06-18 14:38:11 +0000 UTC", "nextReconciliation": "2025-06-18T14:38:16Z", "requeueAfter": "2.396345065s"}
2025-06-18T14:38:13Z	INFO	EphemeralRunner	Backing off the next reconciliation due to failure	{"version": "0.12.0", "ephemeralrunner": {"name":"redacted-ubuntu-blue-pthw7-runner-r94hn","namespace":"gha-runners"}, "lastFailure": "2025-06-18 14:38:11 +0000 UTC", "nextReconciliation": "2025-06-18T14:38:16Z", "requeueAfter": "2.388365807s"}

One interesting thing about the 0.12.0 logs is there is no line like

2025-06-14T12:03:54Z	INFO	EphemeralRunnerSet	Deleting finished ephemeral runner	{"version": "0.11.0", "ephemeralrunnerset": {"name":"redacted-ubuntu-blue-r8m4f","namespace":"gha-runners"}, "name": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}

Both of these runners finished successfully, per the AutoscalingListener pod:

0.11.0

2025-06-14T12:03:53Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 0, "Result": "succeeded", "RunnerId": 10419516, "RunnerName": "redacted-ubuntu-blue-r8m4f-runner-ljvb4"}

0.12.0

2025-06-18T14:38:11Z	INFO	listener-app.listener	Job completed message received.	{"RequestId": 0, "Result": "succeeded", "RunnerId": 10485340, "RunnerName": "redacted-ubuntu-blue-pthw7-runner-r94hn"}

nimjor avatar Jun 20 '25 14:06 nimjor

We are running into the same issue after upgrading to 0.12.0. In the previous version, there would be Failed ephemeral runners that wouldn't get cleaned up but at least it was clear what the resolution should be or what to monitor.

With the new release, it even takes that away. The stats for autoscalingrunnersets get to an invalid state and we only managed to figure out based on the age of the ephemeralrunners. The runner pods do not exist.

muawiakh avatar Jun 23 '25 10:06 muawiakh

We're experiencing a similar issue where Ephemeral Runners get stuck in the "Running" state without corresponding runner pods. To address this, I created a bash script that runs via CronJob to terminate such orphaned runners: https://gist.github.com/dx0x58/b2ae1982b5e9589677c1ddd9e3a6c24a

check-phantom-runners.sh --fix --namespace {{ arc_runners_namespace }} --age 150

The --age parameter specifies the threshold (in seconds) for the creation timestamp of ephemeral runners that should be terminated.

dx0x58 avatar Jun 23 '25 11:06 dx0x58

We are experiencing the same issue. Much like @muawiakh, in the previous version, we had issues with ephemeralRunners ending up in the "Failed" state, now they are ending up in the "Running" state, but without any corresponding pods.

mklauber avatar Jun 23 '25 12:06 mklauber

I can corroborate this issue, was going to open it myself but didn't get a chance to yet. The EphemeralRunners exist indefinitely with a .status.phase: Running. I'll share the cronjob setup I added to buy me time to continue investigating without blocking our users' job startups: zombie-runner-cleanup.yaml

This issue was not happening to our runners in 0.11.0. And it is not intermittent, in the sense that the overall issue doesn't come and go by the day, it is always affecting some percentage of our runners, but it IS intermittent in the sense that it seems to happen randomly to our jobs, with no discernible difference. It seems to affect roughly 2-20 jobs an hour for us. If we don't clean them up, the controller seems to be counting those as part of the current scale metric so it doesn't think it needs more runners added to meet demand, thus the increasing length of job queue.

Thanks for the script, this appears to be working for us.

mklauber avatar Jun 23 '25 13:06 mklauber

Hey everyone, I just wanted to let you all know that we identified the problem.

The check to see if the runner exists within the service can sometimes return a false positive result. Even though this will be fixed on the back-end, the PR #4142 should also resolve the issue, since we don't need this check.

As long as the runner image is properly built (i.e. the entrypoint will return the exit code of the runner), the check we are doing right now is not necessary. Therefore, we will remove it.

nikola-jokic avatar Jun 24 '25 11:06 nikola-jokic

We are running into a similar issue like @kyrylomiro mentioned. But for me, the pod is also running yet the workflow is completed and this is causing high queue time for workflows.

Controller Version: 0.11.0

Status:
  Failures:
    0edacef2-b018-45e3-a863-d8ff765e6e63:  true
  Job Repository Name:                     xxxx
  Job Workflow Ref:                        xxxx.yml@refs/pull/763/merge
  Phase:                                   Running
  Ready:                                   true
  Runner Id:                               1207620

shivansh-ptr avatar Jun 24 '25 12:06 shivansh-ptr

We are running into a similar issue like @kyrylomiro mentioned. But for me, the pod is also running yet the workflow is completed and this is causing high queue time for workflows.

Controller Version: 0.11.0

Status:
  Failures:
    0edacef2-b018-45e3-a863-d8ff765e6e63:  true
  Job Repository Name:                     xxxx
  Job Workflow Ref:                        xxxx.yml@refs/pull/763/merge
  Phase:                                   Running
  Ready:                                   true
  Runner Id:                               1207620

@shivansh-ptr that sounds like a separate type of problem and belongs in a separate issue

nimjor avatar Jun 24 '25 12:06 nimjor

Hey @shivansh-ptr,

That is exactly the root of the problem. After the workflow is done, we check if the runner exists. Since it does (in this case), we mark the ephemeral runner as failed, which creates this entry in Failures. It would then start the crash loop (since at that point, the runner registration is invalid), and would cause the ephemeral runner to reach the failed state.

nikola-jokic avatar Jun 24 '25 12:06 nikola-jokic

Hi, I'm using both Controller Version and gha-runner-scale-set version 0.12.0 and experiencing something very similar but with the difference that my workflow doesn't get to run even once. From previous comments my understanding is that the issue is for subsequent runs after at least one successful execution. In my case the EphemeralRunner stays in Pending status and if I describe it I get the same Failure as shown above: "status": { "failures": {"<uuid>": "<timestamp>"}}

Some useful excerpts of the controller's log are:

EphemeralRunner	Backing off the next reconciliation due to failure	{"version": "0.12.0", "ephemeralrunner": {"name":"eks-cluster-dev-9d5w4-runner-2qjpx","namespace":"github-actions-runners"}, "lastFailure": "2025-06-24 14:49:09 +0000 UTC", "nextReconciliation": "2025-06-24T14:49:14Z", "requeueAfter": "4.490815905s"}
......
EphemeralRunner	EphemeralRunner pod is deleted and status is updated with failure count	{"version": "0.12.0", "ephemeralrunner": {"name":"eks-cluster-dev-9d5w4-runner-2qjpx","namespace":"github-actions-runners"}}
......
EphemeralRunner	Updating ephemeral runner status to track the failure count	{"version": "0.12.0", "ephemeralrunner": {"name":"eks-cluster-dev-9d5w4-runner-2qjpx","namespace":"github-actions-runners"}}
......
EphemeralRunner	Ephemeral runner pod has finished, but the runner still exists in the service. Deleting the pod to restart it.	{"version": "0.12.0", "ephemeralrunner": {"name":"eks-cluster-dev-9d5w4-runner-2qjpx","namespace":"github-actions-runners"}}
......

@nikola-jokic I wonder if #4142 fixes the issue for my situation also, because from what was said before I believe it might only fix the scenario for certain runners (i.e. Runner Id N where N>1, but not for N=1)

Thanks!

mgs-garcia avatar Jun 24 '25 15:06 mgs-garcia

Any timeline on when the fix for this will be rolled out as part of a new release?

avadhanij avatar Jun 25 '25 03:06 avadhanij

Hey everyone, just to let you all know, we are targeting Monday for the next patch release that will include this fix.

nikola-jokic avatar Jun 26 '25 15:06 nikola-jokic

FYI a similar bug we still have with 0.12.0 that others might be experiencing but isn't quite the same (our stuck runners stay forever in Running state with failures in status) - https://github.com/actions/actions-runner-controller/issues/4148

tyrken avatar Jun 26 '25 20:06 tyrken

Hey everyone, we decided to publish a new release today! The 0.12.1 is out! 😄

nikola-jokic avatar Jun 27 '25 12:06 nikola-jokic

@nikola-jokic Hi, I still face this issue even on version 0.12.1

Tal-E avatar Jul 07 '25 20:07 Tal-E

Hi @nikola-jokic I still face the same issue on version 0.12.1 as mentioned here. The workflow is completed but the pod is stuck in the running state and the entry in failure is created

shivansh-ptr avatar Aug 01 '25 08:08 shivansh-ptr

Can confirm that I've experienced the same issue on 0.12.1.

[RUNNER 2025-09-30 23:53:10Z ERR  GitHubActionsService] POST request to https://run-actions-3-azure-eastus.actions.githubusercontent.com/1/acquirejob failed. HTTP Status: Conflict
[RUNNER 2025-09-30 23:53:10Z INFO Runner] Skipping message Job. Job message already acquired 'bb518439-0dcf-5bcf-b669-ee40b36f9a00'. job assignment is invalid: MissingKey

ivan-chepurin-immutable avatar Oct 02 '25 04:10 ivan-chepurin-immutable