actions-runner-controller
actions-runner-controller copied to clipboard
Pods stuck in Terminating state
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
summerwind/actions-runner-controller:v0.27.3
Helm Chart Version
actions-runner-controller-0.23.2
CertManager Version
v1.8.0
Deployment Method
Helm
cert-manager installation
Yes
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: v1
kind: Pod
metadata:
annotations:
actions-runner-controller/token-expires-at: "2023-10-17T17:42:53Z"
actions-runner/github-api-creds-secret: softrams-github-secret
actions-runner/id: "34087"
actions-runner/runner-completion-wait-start-timestamp: "2023-10-17T17:02:42Z"
actions-runner/unregistration-failure-message: Bad request - Runner "softrams-2m2v8-2m456"
is still running a job"
actions-runner/unregistration-start-timestamp: "2023-10-17T17:02:42Z"
kubernetes.io/psp: eks.privileged
sync-time: "2023-10-17T16:42:53Z"
creationTimestamp: "2023-10-17T16:42:53Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2023-10-17T17:02:39Z"
finalizers:
- actions.summerwind.dev/runner-pod
labels:
actions-runner: ""
actions-runner-controller/inject-registration-token: "true"
pod-template-hash: 684c9c4dcf
runner-deployment-name: softrams
runner-template-hash: 695ddbd496
name: softrams-2m2v8-2m456
namespace: actions-runner-system
ownerReferences:
- apiVersion: actions.summerwind.dev/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Runner
name: softrams-2m2v8-2m456
uid: b35ffec9-0539-491f-a290-6c3ea16556aa
resourceVersion: "227053726"
uid: ace9801d-458f-4f71-b768-c71176fd56fd
spec:
containers:
- env:
- name: RUNNER_ORG
value: softrams
- name: RUNNER_REPO
- name: RUNNER_ENTERPRISE
- name: RUNNER_LABELS
value: self-hosted,linux,ubuntu-latest,ubuntu-18.04,ubuntu-20.04
- name: RUNNER_GROUP
- name: DOCKER_ENABLED
value: "true"
- name: DOCKERD_IN_RUNNER
value: "false"
- name: GITHUB_URL
value: https://github.com/
- name: RUNNER_WORKDIR
value: /runner/_work
- name: RUNNER_EPHEMERAL
value: "true"
- name: RUNNER_STATUS_UPDATE_HOOK
value: "false"
- name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
value: actions-runner-controller/v0.27.3
- name: DOCKER_HOST
value: unix:///run/docker/docker.sock
- name: RUNNER_NAME
value: softrams-2m2v8-2m456
- name: RUNNER_TOKEN
value: A2IJC4U7WPPUATJAJ7L77FTFF3DZ3AVPNFXHG5DBNRWGC5DJN5XF62LEZYA2PBTSWFUW443UMFWGYYLUNFXW4X3UPFYGLN2JNZ2GKZ3SMF2GS33OJFXHG5DBNRWGC5DJN5XA
image: summerwind/actions-runner:v2.303.0-ubuntu-22.04
imagePullPolicy: Always
name: runner
resources: {}
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /runner
name: runner
- mountPath: /runner/_work
name: work
- mountPath: /run/docker
name: docker-sock
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-nzwml
readOnly: true
- args:
- dockerd
- --host=unix:///run/docker/docker.sock
- --group=$(DOCKER_GROUP_GID)
- --registry-mirror=http://docker-registry.docker-registry:5000
env:
- name: DOCKER_GROUP_GID
value: "121"
image: docker:23.0.5-dind
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- timeout "${RUNNER_GRACEFUL_STOP_TIMEOUT:-15}" /bin/sh -c "echo 'Prestop
hook started'; while [ -f /runner/.runner ]; do sleep 1; done; echo 'Waiting
for dockerd to start'; while ! pgrep -x dockerd; do sleep 1; done; echo
'Prestop hook stopped'" >/proc/1/fd/1 2>&1
name: docker
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /runner
name: runner
- mountPath: /run/docker
name: docker-sock
- mountPath: /runner/_work
name: work
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-nzwml
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: regcred
nodeName: ip-172-16-2-67.ec2.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir: {}
name: runner
- emptyDir: {}
name: work
- emptyDir:
medium: Memory
sizeLimit: 1M
name: docker-sock
- name: kube-api-access-nzwml
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-10-17T16:42:53Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-10-17T16:43:09Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-10-17T16:43:09Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-10-17T16:42:53Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://6b20ac8c94f013a8820d46e5391729ff8206731aaf5d60c90fd344d3f20cc4ff
image: docker.io/library/docker:23.0.5-dind
imageID: docker.io/library/docker@sha256:f23f0a4013f184f6af3a3892dd12eba74bdbc5988d2a54ae468a8a6a44078434
lastState: {}
name: docker
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-10-17T16:43:08Z"
- containerID: containerd://3a42625e2e9dd94b0c5a3f597bee4a953c37382e1601a6fa3332d04bf4cb90ac
image: docker.io/summerwind/actions-runner:v2.303.0-ubuntu-22.04
imageID: docker.io/summerwind/actions-runner@sha256:90bac4a220c9a5b501d822a1e59b22f842a32c4e4d72b82ff9ea955a2ef7fbe2
lastState: {}
name: runner
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-10-17T16:43:07Z"
hostIP: 172.16.2.67
phase: Running
podIP: 172.16.2.72
podIPs:
- ip: 172.16.2.72
qosClass: BestEffort
startTime: "2023-10-17T16:42:53Z"
To Reproduce
Terminate a node that is running the pod.
Describe the bug
When the node gets removed/terminated, the runner pod is stuck in Terminating state. The restartpolicy for the pod is set to Never. Observed that each of the containers inside the pod is currently in the running state while the node has been removed completely. So, the finalizer assumes the container is still running and not removing the runner pod which is stuck in Terminating state.
Describe the expected behavior
The pods that are stuck in terminating state must be removed gracefully.
Whole Controller Logs
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-9t6zs"}
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-q2t9w"}
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-zkcp5"}
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-75r9n"}
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-8mgzs"}
2023-10-20T12:04:13Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-r446l"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-9fbpz"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-fmfcf"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-8l7cn"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-ptdd9"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-j8v8s"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-f9lpf"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-6l8bc"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-rz595"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-plq4r"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-qf4hd"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-9t6zs"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-9l58l"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-2m456"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-znjzc"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-lvfvk"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-q2t9w"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-zkcp5"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-75r9n"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-8mgzs"}
2023-10-20T12:04:14Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-r446l"}
2023-10-20T12:04:15Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/softrams-2m2v8-8l7cn"}
Whole Runner Pod Logs
Defaulted container "runner" out of: runner, docker
Error from server (NotFound): pods "ip-172-16-2-67.ec2.internal" not found
Additional Context
No response
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I have just observed the same behaviour :)
I have the runner set configured with minRunners: 3 and maxRunners: 27`
I have a custom "homelab" k8s cluster.
One of the nodes went offline so the pod was left "hanging".
NAME READY STATUS RESTARTS AGE
arc-runner-set-k4t2m-runner-dhvfm 1/1 Terminating 0 4h26m
arc-runner-set-k4t2m-runner-mr7qx 1/1 Running 0 12m
arc-runner-set-k4t2m-runner-v2958 1/1 Running 0 9m33s
The ARC controlle will consider the "hanging" pod as active and will not schedule a new runner.
In my case, I have 3 active GHA jobs. 2 of the jobs are executed by GHA, but the 3rd job is waiting for an available runner... and ARC will not add a new runner :)
Maybe if a pod is in "Terminating" state, it should not count as available/active/running.
I tried a manual / force delete of the pod
$ kubectl delete pod arc-runner-set-k4t2m-runner-dhvfm --grace-period=0 --force -n arc-runners
... but the controller is still stuck
2023-10-21T13:20:02Z INFO EphemeralRunner Waiting for ephemeral runner owned resources to be deleted {"ephemeralrunner": "arc-runners/arc-runner-set-k4t2m-runner-dhvfm"}
When I was checking the ephemeralrunners.actions.github.com CRDs I could see that the one for my stucked pod was missing... and there were other definitions for pods that no longer exits.
I have manually removed the extra CRD and ARC works again ok.
Somehow they got out of sink.
I have just using ARC.
Not sure how often this can happen.
I was expecting for ARC to have some sort of "self-healing" process that runs every once in a while and check that there are no CRDs created more than 5 minutes ago for pods that don't exist.
I still see the pods stuck in Terminating state - 53 of them - some of them are stuck for more than a week! The nodes to which these pods are attached are all gone!
If you want to recover this cluster, I think that you will need to manually force delete the pods.
And then look for the ephemeralrunners.actions.github.com CRD objects and also delete them.
Well, we had cleaned it up last time and still it keeps popping up.
Also, don't see ephemeralrunners.actions.github.com CRD object.
We are having the same problem. I see issues #1369 and #236 both were opened for similar problems and 1369 kind of tapered off -- we see pods stuck in Terminating for days. Removing the finalizer doesn't always fix it, we still have to force delete them sometimes. There are no ephemeralrunners.actions.github.com CRDs
Seeing this issue as well with hundreds of Terminating pods. Removing finalizers and forcefully delete them pods removes them but why is this happening?
same here, pods stuck in Terminating, with phase: Running
nodes doesn't even exist anymore (spot instances).
controller log in loop:
message:2024-05-09T20:23:03Z INFO runnerpod Unregistration started before runner obtains ID. Waiting for the registration timeout to elapse, or the runner to obtain ID, or the runner pod to stop {"runnerpod": "actions-runner-controller/myrunnerpod-name-shr-rd-ggdrs-dqqwx", "registrationTimeout": "10m0s"}
to "fix" this issue, at least kill the pod, it's necessary to remove the finalizer. as it is in terminating state, it supress the autoscaling and doesn't run queued jobs.
I'm still seeing this issue.
Ditto
any update on this?
For me patching the ephemeralrunners.actions.github.com finalizer fixed stuck terminating arc-runners namespace:
kubectl get ephemeralrunners.actions.github.com -n arc-runners -o name \
| xargs -I {} kubectl patch {} -n arc-runners \
-p '{"metadata":{"finalizers":null}}' --type=merge
Think it wasn't finalizing for this reason:
Failed to create the pod: pods "arc-runner-set-2pkbf-runner-h8h7g" is forbidden: violates PodSecurity "baseline:latest": privileged (container "dind" must not set securityContext.privileged=true)
I was trying to nuke the namespace and rebuild it with the permissions it needed. Repro might to have a unprivileged namespace and:
# https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#using-docker-in-docker-or-kubernetes-mode-for-containers
containerMode:
type: "dind" # type can be set to dind or kubernetes
For me it was necessary to remove the runner finalizer:
kubectl patch runner <runner-name> -p '{"metadata": {"finalizers": null}}' --type merge
ps: without --type merge I got
error: application/strategic-merge-patch+json is not supported by actions.summerwind.dev/v1alpha1, Kind=Runner: the body of the request was in an unknown format - accepted media types include: application/json-patch+json, app
lication/merge-patch+json, application/apply-patch+yaml