Job pod failed to start on GKE Autopilot with container hooks (kubernetes mode)
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.8.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
runner-scale-set-values.yaml
githubConfigUrl: "https://github.com/my/repo"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
template:
spec:
securityContext:
fsGroup: 1001
serviceAccountName: gke-autopilot-gha-rs-kube-mode
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
- name: pod-templates
configMap:
name: pod-templates
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command:
- /home/runner/run.sh
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/default.yaml
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
- name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
value: actions-runner-controller/0.8.3
resources:
requests:
cpu: 250m
memory: 1Gi
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
pod-template.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
annotated-by: "extension"
labels:
labeled-by: "extension"
spec:
serviceAccountName: gke-autopilot-gha-rs-kube-mode
securityContext:
fsGroup: 1001
containers:
- name: $job # overwrites job container
resources:
requests:
cpu: "3800m"
memory: "4500"
rbac,yaml
---
# Source: gha-runner-scale-set/templates/kube_mode_serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
# Source: gha-runner-scale-set/templates/kube_mode_role.yaml
# default permission for runner pod service account in kubernetes mode (container hook)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["get", "create"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "create", "delete"]
---
# Source: gha-runner-scale-set/templates/kube_mode_role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: gke-autopilot-gha-rs-kube-mode
subjects:
- kind: ServiceAccount
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
---
Describe the bug
I can see that a runner pod is created but it failed to create the job pod with the message Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"
Describe the expected behavior
I expected it to create a job pod.
Additional Context
It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.
githubConfigUrl: "https://github.com/my/org"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 4Gi
template:
spec:
securityContext:
fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
controllerServiceAccount:
namespace: actions
name: gha-runner-scale-set-controller-gha-rs-controller
Controller Logs
No errors, just regular logs. I can provide it if required.
Runner Pod Logs
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "action": "Pre Job Hook",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "type": "runner",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stage": "Pre",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stepId": "06f9adc3-e79d-405b-91eb-a7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stepContextName": "06f9adc3e79d405b91eba7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "result": "failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "errorMessages": [
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Process completed with exit code 1.",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] ],
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "executionTimeInSeconds": 42,
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "startTime": "2024-03-27T15:18:57.1056563Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "finishTime": "2024-03-27T15:19:38.206926Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "containerHookData": "{\"hookScriptPath\":\"/home/runner/k8s/index.js\"}"
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] }.
[WORKER 2024-03-27 15:19:38Z INFO StepsRunner] Update job result with current step result 'Failed'.
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I also tried the same config with GKE standard cluster and I'm running into https://github.com/actions/actions-runner-controller/issues/3132.
Hey @knkarthik,
I'm not sure that you are using the right service account. You should not use the service account of the controller, but rather the service account for the service account with the permissions you posted.
Thanks for the reply and sorry to confuse you @nikola-jokic.
I'm indeed using gke-autopilot-gha-rs-kube-mode which has the necessary permissions as the service account, afaik.
The following is actually commented out in my values file but in my post it was not. I've removed it from my original post now to make it clear.
controllerServiceAccount:
namespace: actions
name: gha-runner-scale-set-controller-gha-rs-controller
Can you please monitor the cluster and run kubectl describe when the workflow pod is created?
@nikola-jokic I did some digging and unfortunately, the pod appears for < 1s and I'm not able to describe it. However, when I run kubectl events, I get OutOfcpu warning for the -workflow pod. So this seems to be the same issue as https://github.com/actions/actions-runner-controller/discussions/2527 and https://github.com/kubernetes/kubernetes/issues/115325.
> kubectl get events -n actions
LAST SEEN TYPE REASON OBJECT MESSAGE
9m4s Normal WaitForPodScheduled persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work waiting for pod gke-autopilot-c4pk8-runner-hqz89 to be scheduled
9m3s Normal WaitForFirstConsumer persistentvolumeclaim/gke-autopilot-c4pk8-runner-hqz89-work waiting for first consumer to be created before binding
9m4s Warning FailedScheduling pod/gke-autopilot-c4pk8-runner-hqz89 0/2 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "gke-autopilot-c4pk8-runner-hqz89-work". preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
12m Normal WaitForPodScheduled persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work waiting for pod gke-autopilot-c4pk8-runner-lxzqj to be scheduled
11m Normal ExternalProvisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
12m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal Provisioning persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work External provisioner is provisioning volume for claim "actions/gke-autopilot-c4pk8-runner-lxzqj-work"
11m Normal ProvisioningSucceeded persistentvolumeclaim/gke-autopilot-c4pk8-runner-lxzqj-work Successfully provisioned volume pvc-91216e22-4299-422f-977b-51f3fcb219e1
9m15s Warning OutOfcpu pod/gke-autopilot-c4pk8-runner-lxzqj-workflow Node didn't have enough resource: cpu, requested: 4000, used: 1849, capacity: 1930
11m Normal Scheduled pod/gke-autopilot-c4pk8-runner-lxzqj Successfully assigned actions/gke-autopilot-c4pk8-runner-lxzqj to gk3-autopilot-pov-pool-2-3bb9a724-7q2p
10m Warning FailedMount pod/gke-autopilot-c4pk8-runner-lxzqj MountVolume.SetUp failed for volume "pod-templates" : configmap "pod-templates" not found
11m Normal SuccessfulAttachVolume pod/gke-autopilot-c4pk8-runner-lxzqj AttachVolume.Attach succeeded for volume "pvc-91216e22-4299-422f-977b-51f3fcb219e1"
10m Normal Pulling pod/gke-autopilot-c4pk8-runner-lxzqj Pulling image "ghcr.io/actions/actions-runner:latest"
10m Normal Pulled pod/gke-autopilot-c4pk8-runner-lxzqj Successfully pulled image "ghcr.io/actions/actions-runner:latest" in 238.11642ms (238.134258ms including waiting)
10m Normal Created pod/gke-autopilot-c4pk8-runner-lxzqj Created container runner
10m Normal Started pod/gke-autopilot-c4pk8-runner-lxzqj Started container runner
@knkarthik, not sure if it is just that, but i managed to pass in resources for a gpu job with a confimap very similar to yours, just removing the comments on the $job name line.. i don't know if you added that just here, but might be worth trying without it..
mine looks like this.
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
annotated-by: "extension"
labels:
labeled-by: "extension"
spec:
containers:
- name: $job
resources:
limits:
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4