Runners not terminating after job completion – blocked queue due to token expiry (v0.12.1)
Checks
- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided
Controller Version
0.12.1
Deployment Method
Helm
Checks
- [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Project: Actions-Runner-Controller (not Summerwind)
Version: 0.12.1
Deployment Method: Helm
Kubernetes Version: v1.32.5
---
1. Deploy Actions-Runner-Controller version 0.11.0 via Helm on a Kubernetes cluster (v1.32.5) with GitHub Enterprise integration.
2. Verify that runners operate correctly under normal load.
3. Upgrade to version 0.12.1 by fully removing all ARC-related resources, including CustomResourceDefinitions (CRDs), and perform a clean installation using Helm.
4. Reconfigure and deploy runners as before.
5. Execute various GitHub Actions workflows across multiple repositories.
6. After some time, observe that:
• Certain jobs appear completed or failed on GitHub Enterprise.
• Some runner pods remain active indefinitely and do not exit.
• Logs within those pods show repeated registration failures with messages like:
"Registration was not found or is not medium trusted."
• The issue affects different runners at different times with no identifiable pattern (i.e., across various repos and workflows).
7. As a result, the runner pool becomes blocked, and new jobs are not executed until affected pods are manually terminated.
Describe the bug
Hello team,
After upgrading to ARC version 0.11.0, we noticed that some runners enter a state where they run indefinitely and block new jobs from being picked up. Inside the runner containers, we observed registration failures due to expired tokens.
On the GitHub Enterprise side, those jobs appear to have already completed or failed, but the corresponding runners keep running. It seems that the listener is unable to properly clean up the runner after a job finishes and continuously attempts to re-register it with GitHub.
We were hoping this issue would be resolved in version 0.12.1, but unfortunately, it still persists. In one instance, a pod even ended up in an evicted state.
As a temporary workaround to prevent the job queue from stalling, we’ve implemented a cron job that monitors runner logs and forcefully terminates any pod where the log contains: "Registration was not found or is not medium trusted." This helps keep the runners processing jobs but doesn’t address the root cause.
Is this a known issue, and do you have any recommendations or a potential fix?
Describe the expected behavior
Runners should terminate properly after job completion or failure. They should not attempt to re-register if the job has already ended and the registration token has expired. Additionally, the controller should ensure that expired or stuck runners are cleaned up automatically to avoid blocking the job queue.
Additional Context
githubConfigUrl: "https://github.enterprise.example.com/enterprises/***"
githubConfigSecret: "github-token"
proxy:
http:
url: http://**********
https:
url: http://**********
noProxy:
- localhost
- 127.0.0.1
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
maxRunners: 5
minRunners: 1
runnerGroup: "enterprise-gpr-m-02"
runnerScaleSetName: "enterprise-gpr-m"
labels:
group: enterprise-runners
githubServerTLS:
certificateFrom:
configMapKeyRef:
name: ca
key: ca.crt
runnerMountPath: /usr/local/share/ca-certificates/
template:
spec:
initContainers:
- name: init-dind-externals
image: actions/actions-runner/full:2.322.1
imagePullPolicy: Always
command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
resources:
requests:
cpu: "50m"
memory: "200Mi"
limits:
memory: "250Mi"
- name: init-dind-rootless
image: docker:27.3.1-dind-rootless
imagePullPolicy: IfNotPresent
command:
- sh
- -c
- |
set -x
cp -a /etc/. /dind-etc/
echo 'runner:x:1001:1001:runner:/home/runner:/bin/ash' >> /dind-etc/passwd
echo 'runner:x:1001:' >> /dind-etc/group
echo 'runner:100000:65536' >> /dind-etc/subgid
echo 'runner:100000:65536' >> /dind-etc/subuid
chmod 755 /dind-etc;
chmod u=rwx,g=rx+s,o=rx /dind-home
chown 1001:1001 /dind-home
mkdir -p /var/lib/docker
chmod u=rwx,g=rx+s,o=rx /var/lib/docker
chown -R 1001:1001 /var/lib/docker
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /dind-etc
name: dind-etc
- mountPath: /dind-home
name: dind-home
- name: docker-data-root
mountPath: /var/lib/docker
resources:
requests:
cpu: "50m"
memory: "200Mi"
limits:
memory: "250Mi"
- name: init-qemu-registrar
image: tonistiigi/binfmt:latest
command: [ "/usr/bin/binfmt", "--install", "all" ]
imagePullPolicy: Always
securityContext:
runAsUser: 0
privileged: true
resources:
requests:
cpu: "25m"
memory: "50Mi"
limits:
memory: "100Mi"
containers:
- name: runner
image: actions/actions-runner/full:2.322.1
imagePullPolicy: Always
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///run/user/1001/docker.sock
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
- mountPath: /tmp
name: tmpdir
- name: sysfs
mountPath: /sys
readOnly: false
resources:
requests:
cpu: "100m"
memory: "500Mi"
limits:
memory: "500Mi"
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_PTRACE
- DAC_OVERRIDE
- FOWNER
- CHOWN
- SETUID
- SETGID
runAsUser: 1001
runAsGroup: 1001
privileged: false
- name: dind
image: docker:27.3.1-dind-rootless
imagePullPolicy: IfNotPresent
args:
- dockerd
- --config-file=/etc/docker/daemon.json
securityContext:
privileged: true
runAsUser: 1001
runAsGroup: 1001
capabilities:
add:
- SYS_ADMIN
- MKNOD
- CHOWN
- SETUID
- SETGID
resources:
requests:
cpu: "200m"
memory: "650Mi"
limits:
memory: "3346Mi"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
- name: dind-externals
mountPath: /home/runner/externals
- name: dind-etc
mountPath: /etc
- name: dind-home
mountPath: /home/runner
- name: docker-data-root
mountPath: /var/lib/docker
- name: sysfs
mountPath: /sys
readOnly: false
volumes:
- name: work
emptyDir: {}
- name: dind-externals
emptyDir: {}
- name: dind-sock
emptyDir: {}
- name: dind-etc
emptyDir: {}
- name: dind-home
emptyDir: {}
- name: tmpdir
emptyDir: {}
- name: docker-data-root
emptyDir: {}
- name: sysfs
hostPath:
path: /sys
type: Directory
Controller Logs
-
Runner Pod Logs
√ Connected to GitHub
[RUNNER 2025-07-16 07:45:34Z INFO Terminal] WRITE LINE:
[RUNNER 2025-07-16 07:45:34Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] Catch exception during create session.
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration was not found or is not medium trust. ClientType:
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener[] at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener[] at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-07-16 07:45:35Z ERR MessageListener] Test oauth app registration.
[RUNNER 2025-07-16 07:45:35Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Retriable exception: Registration was not found or is not medium trust. ClientType:
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Sleeping for 30 seconds before retrying.
We are seeing this issue as well. It blocks other jobs from starting. 🥲
Hey @kpinarci,
We really need the controller log in this case in order to investigate the issue. I noticed that the image is custom-built. If the exit code for the runner container is not 0, the controller will restart the pod. Otherwise, it will simply remove the runner without restarting it.
The fact that 0.12.1 is restarting the pod indicates that the exit code is not 0. Please submit the controller log where we can search for the pod affected by this.
Hey @nikola-jokic,
thanks for your response. I’m uploading the controller logs to my Gist — perhaps you could let me know what exactly you’re looking for in the logs so I can exclude the irrelevant lines up front.
I’ve also added our custom actions image for your reference. We try to keep the image as minimal as possible.
Looking forward to your feedback! If anything’s missing, just let me know — happy to provide more. Thanks in advance!
@kpinarci Do you mind sharing what your cronjob looks like? We have been having issues with blocked or delayed queues for jobs, from our logs best we can tell, we are seeing this token failure happen also but we're unsure if thats the actual root cause.
Hey @surgiie,
No problem — I checked the runner container logs and saw this message: “Registration was not found or is not medium trust.” Looks like the pods weren’t being cleaned up, and that’s what the cronjob is handling now.
Let me know if you want more details – sharing is caring 😄
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-cleaner
namespace: enterprise-gpr-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-cleaner
namespace: enterprise-dependabot-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-cleaner
namespace: enterprise-mkdocs-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-cleaner
namespace: enterprise-gpr-generic-m-02
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: runner-cleaner-clusterrolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: runner-cleaner-clusterrole
subjects:
- kind: ServiceAccount
name: runner-cleaner
namespace: enterprise-gpr-m-02
- kind: ServiceAccount
name: runner-cleaner
namespace: enterprise-dependabot-m-02
- kind: ServiceAccount
name: runner-cleaner
namespace: enterprise-mkdocs-m-02
- kind: ServiceAccount
name: runner-cleaner
namespace: enterprise-gpr-generic-m-02
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: runner-cleaner-clusterrole
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "delete"]
- apiGroups: ["actions.github.com"]
resources: ["ephemeralrunners", "ephemeralrunners/status"]
verbs: ["get", "list", "delete", "watch"]
- apiGroups: ["management.cattle.io"]
resources: ["clusters"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/log", "pods"]
verbs: ["get", "list", "watch"]
---
kind: ConfigMap
apiVersion: v1
metadata:
name: runner-cleaner-failed-registration-runner
namespace: enterprise-gpr-m-02
data:
remove_failed_registration_runners.sh: |
#!/usr/bin/env bash
set -euo pipefail
# Namespace list
NAMESPACES=("enterprise-gpr-m-02" "enterprise-dependabot-m-02" "enterprise-mkdocs-m-02" "enterprise-gpr-generic-m-02")
for NAMESPACE in "${NAMESPACES[@]}"; do
echo "Checking namespace: $NAMESPACE"
PODS=$(kubectl get pods --namespace "$NAMESPACE" -l app.kubernetes.io/component=runner -o jsonpath='{.items[*].metadata.name}')
for pod in $PODS; do
if kubectl logs "$pod" --container runner --namespace "$NAMESPACE" | grep -E "Registration was not found or is not medium( trust)?\.?"; then
echo "🔥Pod $pod in namespace $NAMESPACE has a registration error - will be deleted"
kubectl delete pod "$pod" --namespace "$NAMESPACE"
else
echo "Error message not found in pod $pod"
fi
done
done
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: cronjob-remove-failed-registration-runner
namespace: enterprise-gpr-m-02
labels:
app: arc-cleanup-registration-cronjob
env: prod
spec:
schedule: "*/5 * * * *"
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
serviceAccountName: runner-cleaner
containers:
- name: cronjob-remove-failed-registration-runner
image: ******/helm-kubectl/full:3.18.3
imagePullPolicy: IfNotPresent
command: ["/bin/bash", "/scripts/remove_failed_registration_runners.sh"]
resources:
requests:
cpu: "50m"
memory: "50Mi"
limits:
memory: "100Mi"
volumeMounts:
- name: scripts-volume
mountPath: "/scripts"
restartPolicy: OnFailure
volumes:
- name: scripts-volume
configMap:
name: runner-cleaner-failed-registration-runner
defaultMode: 0700
We're also encountering this. We have a cronjob which deletes 1hour old runners but the "resolution" takes place only after 1 hour 🙈 Going to to use this cronjob now, it's much much better (faster) than our cronjob ❤️
I observed this behavior as well. My observation is that the runner registration in the GitHub organization is removed but the ephemeral runner resource is still present. So I think even if a custom image is used and the exit code is not as expected there is an inconsistency between the ephemeral runner resource and the runner registration in the GitHub organization.
Should the cronjob delete the backing ephemeralrunner? If you delete the pod, the ephemeralrunner seems to just spawn it again.
That's what our cron(s) do. We delete the ephemeralrunner yes.
I can affirm, that it works very well as a workaround