actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Pods stuck in Terminating state

Open maggisha opened this issue 2 years ago • 20 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

summerwind/actions-runner-controller:v0.27.3

Helm Chart Version

actions-runner-controller-0.23.2

CertManager Version

v1.8.0

Deployment Method

Helm

cert-manager installation

Yes

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: v1
kind: Pod
metadata:
  annotations:
    actions-runner-controller/token-expires-at: "2023-10-17T17:42:53Z"
    actions-runner/github-api-creds-secret: softrams-github-secret
    actions-runner/id: "34087"
    actions-runner/runner-completion-wait-start-timestamp: "2023-10-17T17:02:42Z"
    actions-runner/unregistration-failure-message: Bad request - Runner "softrams-2m2v8-2m456"
      is still running a job"
    actions-runner/unregistration-start-timestamp: "2023-10-17T17:02:42Z"
    kubernetes.io/psp: eks.privileged
    sync-time: "2023-10-17T16:42:53Z"
  creationTimestamp: "2023-10-17T16:42:53Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-10-17T17:02:39Z"
  finalizers:
  - actions.summerwind.dev/runner-pod
  labels:
    actions-runner: ""
    actions-runner-controller/inject-registration-token: "true"
    pod-template-hash: 684c9c4dcf
    runner-deployment-name: softrams
    runner-template-hash: 695ddbd496
  name: softrams-2m2v8-2m456
  namespace: actions-runner-system
  ownerReferences:
  - apiVersion: actions.summerwind.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Runner
    name: softrams-2m2v8-2m456
    uid: b35ffec9-0539-491f-a290-6c3ea16556aa
  resourceVersion: "227053726"
  uid: ace9801d-458f-4f71-b768-c71176fd56fd
spec:
  containers:
  - env:
    - name: RUNNER_ORG
      value: softrams
    - name: RUNNER_REPO
    - name: RUNNER_ENTERPRISE
    - name: RUNNER_LABELS
      value: self-hosted,linux,ubuntu-latest,ubuntu-18.04,ubuntu-20.04
    - name: RUNNER_GROUP
    - name: DOCKER_ENABLED
      value: "true"
    - name: DOCKERD_IN_RUNNER
      value: "false"
    - name: GITHUB_URL
      value: https://github.com/
    - name: RUNNER_WORKDIR
      value: /runner/_work
    - name: RUNNER_EPHEMERAL
      value: "true"
    - name: RUNNER_STATUS_UPDATE_HOOK
      value: "false"
    - name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
      value: actions-runner-controller/v0.27.3
    - name: DOCKER_HOST
      value: unix:///run/docker/docker.sock
    - name: RUNNER_NAME
      value: softrams-2m2v8-2m456
    - name: RUNNER_TOKEN
      value: A2IJC4U7WPPUATJAJ7L77FTFF3DZ3AVPNFXHG5DBNRWGC5DJN5XF62LEZYA2PBTSWFUW443UMFWGYYLUNFXW4X3UPFYGLN2JNZ2GKZ3SMF2GS33OJFXHG5DBNRWGC5DJN5XA
    image: summerwind/actions-runner:v2.303.0-ubuntu-22.04
    imagePullPolicy: Always
    name: runner
    resources: {}
    securityContext: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /runner
      name: runner
    - mountPath: /runner/_work
      name: work
    - mountPath: /run/docker
      name: docker-sock
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-nzwml
      readOnly: true
  - args:
    - dockerd
    - --host=unix:///run/docker/docker.sock
    - --group=$(DOCKER_GROUP_GID)
    - --registry-mirror=http://docker-registry.docker-registry:5000
    env:
    - name: DOCKER_GROUP_GID
      value: "121"
    image: docker:23.0.5-dind
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - timeout "${RUNNER_GRACEFUL_STOP_TIMEOUT:-15}" /bin/sh -c "echo 'Prestop
            hook started'; while [ -f /runner/.runner ]; do sleep 1; done; echo 'Waiting
            for dockerd to start'; while ! pgrep -x dockerd; do sleep 1; done; echo
            'Prestop hook stopped'" >/proc/1/fd/1 2>&1
    name: docker
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /runner
      name: runner
    - mountPath: /run/docker
      name: docker-sock
    - mountPath: /runner/_work
      name: work
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-nzwml
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: regcred
  nodeName: ip-172-16-2-67.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: runner
  - emptyDir: {}
    name: work
  - emptyDir:
      medium: Memory
      sizeLimit: 1M
    name: docker-sock
  - name: kube-api-access-nzwml
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T16:42:53Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T16:43:09Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T16:43:09Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T16:42:53Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://6b20ac8c94f013a8820d46e5391729ff8206731aaf5d60c90fd344d3f20cc4ff
    image: docker.io/library/docker:23.0.5-dind
    imageID: docker.io/library/docker@sha256:f23f0a4013f184f6af3a3892dd12eba74bdbc5988d2a54ae468a8a6a44078434
    lastState: {}
    name: docker
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-10-17T16:43:08Z"
  - containerID: containerd://3a42625e2e9dd94b0c5a3f597bee4a953c37382e1601a6fa3332d04bf4cb90ac
    image: docker.io/summerwind/actions-runner:v2.303.0-ubuntu-22.04
    imageID: docker.io/summerwind/actions-runner@sha256:90bac4a220c9a5b501d822a1e59b22f842a32c4e4d72b82ff9ea955a2ef7fbe2
    lastState: {}
    name: runner
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-10-17T16:43:07Z"
  hostIP: 172.16.2.67
  phase: Running
  podIP: 172.16.2.72
  podIPs:
  - ip: 172.16.2.72
  qosClass: BestEffort
  startTime: "2023-10-17T16:42:53Z"

To Reproduce

Terminate a node that is running the pod.

Describe the bug

When the node gets removed/terminated, the runner pod is stuck in Terminating state. The restartpolicy for the pod is set to Never. Observed that each of the containers inside the pod is currently in the running state while the node has been removed completely. So, the finalizer assumes the container is still running and not removing the runner pod which is stuck in Terminating state.

Describe the expected behavior

The pods that are stuck in terminating state must be removed gracefully.

Whole Controller Logs

2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-9t6zs"}
2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-q2t9w"}
2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-zkcp5"}
2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-75r9n"}
2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-8mgzs"}
2023-10-20T12:04:13Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-r446l"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-9fbpz"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-fmfcf"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-8l7cn"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-ptdd9"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-j8v8s"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-f9lpf"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-6l8bc"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-rz595"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-plq4r"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-qf4hd"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-9t6zs"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-9l58l"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-2m456"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-znjzc"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-lvfvk"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-q2t9w"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-zkcp5"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-75r9n"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-8mgzs"}
2023-10-20T12:04:14Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-r446l"}
2023-10-20T12:04:15Z    INFO    runnerpod       Runner pod is annotated to wait for completion, and the runner container is not restarting      {"runnerpod": "actions-runner-system/softrams-2m2v8-8l7cn"}

Whole Runner Pod Logs

Defaulted container "runner" out of: runner, docker
Error from server (NotFound): pods "ip-172-16-2-67.ec2.internal" not found

Additional Context

No response

maggisha avatar Oct 20 '23 14:10 maggisha

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Oct 20 '23 14:10 github-actions[bot]

I have just observed the same behaviour :)

I have the runner set configured with minRunners: 3 and maxRunners: 27`

I have a custom "homelab" k8s cluster.

One of the nodes went offline so the pod was left "hanging".

NAME                                READY   STATUS        RESTARTS   AGE
arc-runner-set-k4t2m-runner-dhvfm   1/1     Terminating   0          4h26m
arc-runner-set-k4t2m-runner-mr7qx   1/1     Running       0          12m
arc-runner-set-k4t2m-runner-v2958   1/1     Running       0          9m33s

The ARC controlle will consider the "hanging" pod as active and will not schedule a new runner.

In my case, I have 3 active GHA jobs. 2 of the jobs are executed by GHA, but the 3rd job is waiting for an available runner... and ARC will not add a new runner :)

Maybe if a pod is in "Terminating" state, it should not count as available/active/running.

adiroiban avatar Oct 20 '23 23:10 adiroiban

I tried a manual / force delete of the pod


$ kubectl delete pod arc-runner-set-k4t2m-runner-dhvfm --grace-period=0 --force -n arc-runners

... but the controller is still stuck

2023-10-21T13:20:02Z INFO EphemeralRunner Waiting for ephemeral runner owned resources to be deleted {"ephemeralrunner": "arc-runners/arc-runner-set-k4t2m-runner-dhvfm"}


When I was checking the ephemeralrunners.actions.github.com CRDs I could see that the one for my stucked pod was missing... and there were other definitions for pods that no longer exits.

I have manually removed the extra CRD and ARC works again ok.


Somehow they got out of sink.

I have just using ARC.

Not sure how often this can happen.

I was expecting for ARC to have some sort of "self-healing" process that runs every once in a while and check that there are no CRDs created more than 5 minutes ago for pods that don't exist.

adiroiban avatar Oct 21 '23 13:10 adiroiban

I still see the pods stuck in Terminating state - 53 of them - some of them are stuck for more than a week! The nodes to which these pods are attached are all gone!

image

maggisha avatar Oct 23 '23 18:10 maggisha

If you want to recover this cluster, I think that you will need to manually force delete the pods.

And then look for the ephemeralrunners.actions.github.com CRD objects and also delete them.

adiroiban avatar Oct 23 '23 22:10 adiroiban

Well, we had cleaned it up last time and still it keeps popping up.

Also, don't see ephemeralrunners.actions.github.com CRD object.

image

maggisha avatar Oct 26 '23 13:10 maggisha

We are having the same problem. I see issues #1369 and #236 both were opened for similar problems and 1369 kind of tapered off -- we see pods stuck in Terminating for days. Removing the finalizer doesn't always fix it, we still have to force delete them sometimes. There are no ephemeralrunners.actions.github.com CRDs

stockmaj avatar Nov 09 '23 20:11 stockmaj

Seeing this issue as well with hundreds of Terminating pods. Removing finalizers and forcefully delete them pods removes them but why is this happening?

mosheavni avatar Mar 17 '24 09:03 mosheavni

same here, pods stuck in Terminating, with phase: Running nodes doesn't even exist anymore (spot instances).

controller log in loop: message:2024-05-09T20:23:03Z INFO runnerpod Unregistration started before runner obtains ID. Waiting for the registration timeout to elapse, or the runner to obtain ID, or the runner pod to stop {"runnerpod": "actions-runner-controller/myrunnerpod-name-shr-rd-ggdrs-dqqwx", "registrationTimeout": "10m0s"}

to "fix" this issue, at least kill the pod, it's necessary to remove the finalizer. as it is in terminating state, it supress the autoscaling and doesn't run queued jobs.

luizbossoi avatar May 09 '24 20:05 luizbossoi

I'm still seeing this issue.

joaoluiznaufel avatar May 15 '24 13:05 joaoluiznaufel

Ditto

stockmaj avatar Jun 05 '24 15:06 stockmaj

any update on this?

shichengripple001 avatar Jul 11 '24 02:07 shichengripple001

For me patching the ephemeralrunners.actions.github.com finalizer fixed stuck terminating arc-runners namespace:

kubectl get ephemeralrunners.actions.github.com -n arc-runners -o name \
  | xargs -I {} kubectl patch {} -n arc-runners \
  -p '{"metadata":{"finalizers":null}}' --type=merge

Think it wasn't finalizing for this reason:

Failed to create the pod: pods "arc-runner-set-2pkbf-runner-h8h7g" is forbidden: violates PodSecurity "baseline:latest": privileged (container "dind" must not set securityContext.privileged=true)

I was trying to nuke the namespace and rebuild it with the permissions it needed. Repro might to have a unprivileged namespace and:

# https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#using-docker-in-docker-or-kubernetes-mode-for-containers
containerMode:
  type: "dind" # type can be set to dind or kubernetes

willbush avatar Jul 11 '24 03:07 willbush

For me it was necessary to remove the runner finalizer:

kubectl patch runner <runner-name> -p '{"metadata": {"finalizers": null}}' --type merge

ps: without --type merge I got

error: application/strategic-merge-patch+json is not supported by actions.summerwind.dev/v1alpha1, Kind=Runner: the body of the request was in an unknown format - accepted media types include: application/json-patch+json, app
lication/merge-patch+json, application/apply-patch+yaml

augustovictor avatar Jul 24 '24 09:07 augustovictor