actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Runners not terminating after job completion – blocked queue due to token expiry (v0.12.1)

Open kpinarci opened this issue 5 months ago • 9 comments

Checks

  • [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [x] I am using charts that are officially provided

Controller Version

0.12.1

Deployment Method

Helm

Checks

  • [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Project: Actions-Runner-Controller (not Summerwind)
Version: 0.12.1
Deployment Method: Helm
Kubernetes Version: v1.32.5
---
1.	Deploy Actions-Runner-Controller version 0.11.0 via Helm on a Kubernetes cluster (v1.32.5) with GitHub Enterprise integration.
2.	Verify that runners operate correctly under normal load.
3.	Upgrade to version 0.12.1 by fully removing all ARC-related resources, including CustomResourceDefinitions (CRDs), and perform a clean installation using Helm.
4.	Reconfigure and deploy runners as before.
5.	Execute various GitHub Actions workflows across multiple repositories.
6.	After some time, observe that:
	•	Certain jobs appear completed or failed on GitHub Enterprise.
	•	Some runner pods remain active indefinitely and do not exit.
	•	Logs within those pods show repeated registration failures with messages like:
"Registration was not found or is not medium trusted."
	•	The issue affects different runners at different times with no identifiable pattern (i.e., across various repos and workflows).
7.	As a result, the runner pool becomes blocked, and new jobs are not executed until affected pods are manually terminated.

Describe the bug

Hello team,

After upgrading to ARC version 0.11.0, we noticed that some runners enter a state where they run indefinitely and block new jobs from being picked up. Inside the runner containers, we observed registration failures due to expired tokens.

On the GitHub Enterprise side, those jobs appear to have already completed or failed, but the corresponding runners keep running. It seems that the listener is unable to properly clean up the runner after a job finishes and continuously attempts to re-register it with GitHub.

We were hoping this issue would be resolved in version 0.12.1, but unfortunately, it still persists. In one instance, a pod even ended up in an evicted state.

As a temporary workaround to prevent the job queue from stalling, we’ve implemented a cron job that monitors runner logs and forcefully terminates any pod where the log contains: "Registration was not found or is not medium trusted." This helps keep the runners processing jobs but doesn’t address the root cause.

Is this a known issue, and do you have any recommendations or a potential fix?

Describe the expected behavior

Runners should terminate properly after job completion or failure. They should not attempt to re-register if the job has already ended and the registration token has expired. Additionally, the controller should ensure that expired or stuck runners are cleaned up automatically to avoid blocking the job queue.

Additional Context

githubConfigUrl: "https://github.enterprise.example.com/enterprises/***"
githubConfigSecret: "github-token"
proxy:
  http:
    url: http://**********
  https:
    url: http://**********
  noProxy:
    - localhost
    - 127.0.0.1
    - 10.0.0.0/8
    - 172.16.0.0/12
    - 192.168.0.0/16
maxRunners: 5
minRunners: 1
runnerGroup: "enterprise-gpr-m-02"
runnerScaleSetName: "enterprise-gpr-m"
labels:
  group: enterprise-runners
githubServerTLS:
  certificateFrom:
    configMapKeyRef:
      name: ca
      key: ca.crt
  runnerMountPath: /usr/local/share/ca-certificates/
template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: actions/actions-runner/full:2.322.1
        imagePullPolicy: Always
        command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
        resources:
          requests:
            cpu: "50m"
            memory: "200Mi"
          limits:
            memory: "250Mi"
      - name: init-dind-rootless
        image: docker:27.3.1-dind-rootless
        imagePullPolicy: IfNotPresent
        command:
          - sh
          - -c
          - |
            set -x
            cp -a /etc/. /dind-etc/
            echo 'runner:x:1001:1001:runner:/home/runner:/bin/ash' >> /dind-etc/passwd
            echo 'runner:x:1001:' >> /dind-etc/group
            echo 'runner:100000:65536' >> /dind-etc/subgid
            echo 'runner:100000:65536' >>  /dind-etc/subuid
            chmod 755 /dind-etc;
            chmod u=rwx,g=rx+s,o=rx /dind-home
            chown 1001:1001 /dind-home
            mkdir -p /var/lib/docker
            chmod u=rwx,g=rx+s,o=rx /var/lib/docker
            chown -R 1001:1001 /var/lib/docker
        securityContext:
          runAsUser: 0
        volumeMounts:
          - mountPath: /dind-etc
            name: dind-etc
          - mountPath: /dind-home
            name: dind-home
          - name: docker-data-root
            mountPath: /var/lib/docker
        resources:
          requests:
            cpu: "50m"
            memory: "200Mi"
          limits:
            memory: "250Mi"
      - name: init-qemu-registrar
        image: tonistiigi/binfmt:latest
        command: [ "/usr/bin/binfmt", "--install", "all" ]
        imagePullPolicy: Always
        securityContext:
          runAsUser: 0
          privileged: true
        resources:
          requests:
            cpu: "25m"
            memory: "50Mi"
          limits:
            memory: "100Mi"
    containers:
      - name: runner
        image: actions/actions-runner/full:2.322.1
        imagePullPolicy: Always
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///run/user/1001/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - mountPath: /tmp
            name: tmpdir
          - name: sysfs
            mountPath: /sys
            readOnly: false
        resources:
          requests:
            cpu: "100m"
            memory: "500Mi"
          limits:
            memory: "500Mi"
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
              - SYS_PTRACE
              - DAC_OVERRIDE
              - FOWNER
              - CHOWN
              - SETUID
              - SETGID
          runAsUser: 1001
          runAsGroup: 1001
          privileged: false
      - name: dind
        image: docker:27.3.1-dind-rootless
        imagePullPolicy: IfNotPresent
        args:
          - dockerd
          - --config-file=/etc/docker/daemon.json
        securityContext:
          privileged: true
          runAsUser: 1001
          runAsGroup: 1001
          capabilities:
            add:
              - SYS_ADMIN
              - MKNOD
              - CHOWN
              - SETUID
              - SETGID
        resources:
          requests:
            cpu: "200m"
            memory: "650Mi"
          limits:
            memory: "3346Mi"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - name: dind-externals
            mountPath: /home/runner/externals
          - name: dind-etc
            mountPath: /etc
          - name: dind-home
            mountPath: /home/runner
          - name: docker-data-root
            mountPath: /var/lib/docker
          - name: sysfs
            mountPath: /sys
            readOnly: false
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-etc
        emptyDir: {}
      - name: dind-home
        emptyDir: {}
      - name: tmpdir
        emptyDir: {}
      - name: docker-data-root
        emptyDir: {}
      - name: sysfs
        hostPath:
          path: /sys
          type: Directory

Controller Logs

-

Runner Pod Logs

√ Connected to GitHub
[RUNNER 2025-07-16 07:45:34Z INFO Terminal] WRITE LINE: 

[RUNNER 2025-07-16 07:45:34Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR  GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] Catch exception during create session.
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration was not found or is not medium trust. ClientType: 
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener[]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener[]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener]    at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-07-16 07:45:35Z ERR  MessageListener] Test oauth app registration.
[RUNNER 2025-07-16 07:45:35Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-07-16 07:45:35Z ERR  GitHubActionsService] POST request to https://github.enterprise.example.com/_services/vstoken/_apis/oauth2/token/eb530d92-6032-4cac-8ece-acf7fa59845f failed. HTTP Status: BadRequest
[RUNNER 2025-07-16 07:45:35Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Retriable exception: Registration was not found or is not medium trust. ClientType: 
[RUNNER 2025-07-16 07:45:35Z INFO MessageListener] Sleeping for 30 seconds before retrying.

kpinarci avatar Jul 21 '25 15:07 kpinarci

We are seeing this issue as well. It blocks other jobs from starting. 🥲

surgiie avatar Jul 23 '25 10:07 surgiie

Hey @kpinarci,

We really need the controller log in this case in order to investigate the issue. I noticed that the image is custom-built. If the exit code for the runner container is not 0, the controller will restart the pod. Otherwise, it will simply remove the runner without restarting it.

The fact that 0.12.1 is restarting the pod indicates that the exit code is not 0. Please submit the controller log where we can search for the pod affected by this.

nikola-jokic avatar Jul 23 '25 20:07 nikola-jokic

Hey @nikola-jokic,

thanks for your response. I’m uploading the controller logs to my Gist — perhaps you could let me know what exactly you’re looking for in the logs so I can exclude the irrelevant lines up front.

I’ve also added our custom actions image for your reference. We try to keep the image as minimal as possible.

Looking forward to your feedback! If anything’s missing, just let me know — happy to provide more. Thanks in advance!

kpinarci avatar Jul 24 '25 09:07 kpinarci

@kpinarci Do you mind sharing what your cronjob looks like? We have been having issues with blocked or delayed queues for jobs, from our logs best we can tell, we are seeing this token failure happen also but we're unsure if thats the actual root cause.

surgiie avatar Jul 28 '25 18:07 surgiie

Hey @surgiie,

No problem — I checked the runner container logs and saw this message: “Registration was not found or is not medium trust.” Looks like the pods weren’t being cleaned up, and that’s what the cronjob is handling now.

Let me know if you want more details – sharing is caring 😄

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-cleaner
  namespace: enterprise-gpr-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-cleaner
  namespace: enterprise-dependabot-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-cleaner
  namespace: enterprise-mkdocs-m-02
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-cleaner
  namespace: enterprise-gpr-generic-m-02
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: runner-cleaner-clusterrolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: runner-cleaner-clusterrole
subjects:
- kind: ServiceAccount
  name: runner-cleaner
  namespace: enterprise-gpr-m-02
- kind: ServiceAccount
  name: runner-cleaner
  namespace: enterprise-dependabot-m-02
- kind: ServiceAccount
  name: runner-cleaner
  namespace: enterprise-mkdocs-m-02
- kind: ServiceAccount
  name: runner-cleaner
  namespace: enterprise-gpr-generic-m-02
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: runner-cleaner-clusterrole
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["actions.github.com"]
  resources: ["ephemeralrunners", "ephemeralrunners/status"]
  verbs: ["get", "list", "delete", "watch"]
- apiGroups: ["management.cattle.io"]
  resources: ["clusters"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/log", "pods"]
  verbs: ["get", "list", "watch"]
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: runner-cleaner-failed-registration-runner
  namespace: enterprise-gpr-m-02
data:
  remove_failed_registration_runners.sh: |
    #!/usr/bin/env bash

    set -euo pipefail

    # Namespace list
    NAMESPACES=("enterprise-gpr-m-02" "enterprise-dependabot-m-02" "enterprise-mkdocs-m-02" "enterprise-gpr-generic-m-02")

    for NAMESPACE in "${NAMESPACES[@]}"; do
      echo "Checking namespace: $NAMESPACE"

      PODS=$(kubectl get pods --namespace "$NAMESPACE" -l app.kubernetes.io/component=runner -o jsonpath='{.items[*].metadata.name}')
      
      for pod in $PODS; do
        if kubectl logs "$pod" --container runner --namespace "$NAMESPACE" | grep -E "Registration was not found or is not medium( trust)?\.?"; then
          echo "🔥Pod $pod in namespace $NAMESPACE has a registration error - will be deleted"
          kubectl delete pod "$pod" --namespace "$NAMESPACE"
         else
           echo "Error message not found in pod $pod"
        fi
      done
    done
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cronjob-remove-failed-registration-runner
  namespace: enterprise-gpr-m-02
  labels:
    app: arc-cleanup-registration-cronjob
    env: prod
spec:
  schedule: "*/5 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: runner-cleaner
          containers:
          - name: cronjob-remove-failed-registration-runner
            image: ******/helm-kubectl/full:3.18.3
            imagePullPolicy: IfNotPresent
            command: ["/bin/bash", "/scripts/remove_failed_registration_runners.sh"]
            resources:
              requests:
                cpu: "50m"
                memory: "50Mi"
              limits:
                memory: "100Mi"
            volumeMounts:
              - name: scripts-volume
                mountPath: "/scripts"
          restartPolicy: OnFailure
          volumes:
            - name: scripts-volume
              configMap:
                name: runner-cleaner-failed-registration-runner
                defaultMode: 0700

kpinarci avatar Jul 29 '25 08:07 kpinarci

We're also encountering this. We have a cronjob which deletes 1hour old runners but the "resolution" takes place only after 1 hour 🙈 Going to to use this cronjob now, it's much much better (faster) than our cronjob ❤️

genisd avatar Aug 11 '25 09:08 genisd

I observed this behavior as well. My observation is that the runner registration in the GitHub organization is removed but the ephemeral runner resource is still present. So I think even if a custom image is used and the exit code is not as expected there is an inconsistency between the ephemeral runner resource and the runner registration in the GitHub organization.

RaphyFischer avatar Aug 21 '25 07:08 RaphyFischer

Should the cronjob delete the backing ephemeralrunner? If you delete the pod, the ephemeralrunner seems to just spawn it again.

engnatha avatar Sep 10 '25 21:09 engnatha

That's what our cron(s) do. We delete the ephemeralrunner yes. I can affirm, that it works very well as a workaround

genisd avatar Sep 10 '25 22:09 genisd