runner-container-hooks
runner-container-hooks copied to clipboard
Runner to workflow pods take 3 minutes to start on RWX & containerMode: Kubernetes
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Setup arc runner scaleset with containerMode: Kubernetes
Use an NFS based storageclass to back the nodes
build a docker image via GHA using kaniko
Describe the bug
After initializing the runner pod (which is fairly immediate) - the github actions jobs (6 of them) seems to get stuck polling for 2-3 minutes waiting to spin up the workflow pod to continue the github action job.
The runner pod logs show every 5-10 seconds there is a job that polls for 2-3 minutes before the container hook is called and the workflow pod is spun up.
See Line 6-52 in the scaleset logs gist below, you'll see this line get called every few seconds.
[WORKER 2024-12-03 19:21:58Z INFO HostContext] Well known directory 'Root': '/home/runner'
This bug started occuring when we switched to RWX, new storage class using NFS based azure files. I suppose it might be the slowness to provision a PVC using azure files versus traditional disk based setup on RWO
Describe the expected behavior
After initializing the runner pod on new github actions job- the workflow pods should spin up near immediately to process the docker builds from each GHA job.
Additional Context
Here is the arc runner scaleset code
initContainers:
- name: kube-init
image: ghcr.io/actions/actions-runner:latest
command: ["/bin/sh", "-c"]
args:
- |
sudo chown -R ${local.github_runner_user_gid}:123 /home/runner/_work
volumeMounts:
- name: work
mountPath: /home/runner/_work
securityContext:
fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/default.yml
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false" ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
- name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER # Flag enables separate scheduling for worker pods
value: "true"
volumeMounts:
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteMany"]
storageClassName: ${local.storage_class_name}
resources:
requests:
storage: ${local.volume_claim_size}
- name: pod-templates
configMap:
name: "runner-pod-template"
containerMode:
type: "kubernetes" ## type can be set to dind or kubernetes
## the following is required when containerMode.type=kubernetes
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteMany"]
storageClassName: ${local.storage_class_name}
resources:
requests:
storage: ${local.volume_claim_size}
EOF
]
}
locals {
job_template_name = "runner-pod-template"
}
resource "kubernetes_config_map" "job_template" {
metadata {
name = local.job_template_name
namespace = local.gha_runner_namespace
}
data = {
"default.yml" = yamlencode({
apiVersion = "v1"
kind = "PodTemplate"
metadata = {
name = local.job_template_name
}
spec = {
containers = [
{
name = "$job"
resources = {
requests = {
cpu = "3000m"
}
limits = {
cpu = "3000m"
}
}
}
]
}
})
}
}
# GHA job
/kaniko/executor --dockerfile=".Dockerfilehere" \
--context="${{ github.repositoryUrl }}#${{ github.ref }}#${{ github.sha }}" \
--destination="randomcontainerregistry:taghere" \
--use-new-run \
--snapshot-mode=redo \
--compressed-caching=false \
--registry-mirror=mirror.gcr.io \
--cache=true --cache-copy-layers=false --cache-ttl=500h \
--push-retry 5
# Storage class
resource "kubernetes_manifest" "csi_storage_class" {
manifest = {
apiVersion = "storage.k8s.io/v1"
kind = "StorageClass"
metadata = {
name = "storageclassawesome"
}
provisioner = "file.csi.azure.com"
allowVolumeExpansion = true
parameters = {
resourceGroup = "yup"
storageAccount = "yup"
skuName = "Premium_LRS"
location = "sdfsf"
server = "test.net"
}
reclaimPolicy = "Delete"
volumeBindingMode = "Immediate"
mountOptions = [
"dir_mode=0777",
"file_mode=0777",
"uid=1000",
"gid=1000",
"mfsymlinks",
"cache=strict",
"nosharesock",
"actimeo=30"
]
Controller Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
Runner Pod Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
@alexgaganashvili @nikola-jokic Hey Nikola & Alex - I've seen y'all encounter to similar issues like this before, let me know if you see something! Deeply appreciated
I don't think it's the slowness in PV provisioning, since it's the same PV shared between a runner and a workflow pod. Maybe K8s is trying to find a node that fits your resource requests (ACTIONS_RUNNER_USE_KUBE_SCHEDULER=true)? Check also the kube-scheduler logs.
Hey @alexgaganashvili - thanks for the comment. checked kube-scheduler logs (kubectl get events) nothing revealing so far.
The workflow pod does have space to provision in the node (5000m cpu allowable to be requested) - with 1 workflow pod at 3000m cpu request.
I feel it has something to do with this process
If you look at timestamp, it's stuck for a minute repeating the same pod logs. Wonder what the best way to debug this further would be.
Sorry, hard to tell what's causing it. I have not personally run into this issue. I'd suggest you also ask in the Discussions.
@jonathan-fileread , I have switched to NFS-based storage and am using RWX and noticed that the Initialize step (for workflow pods) takes much longer than it does when using RWO with block storage. There's also slowness in checking out code when using a runner pod by itself. Should be expected from NFS, I guess. On the other hand, at least jobs won't fail when using RWO and worklfow pods. Still, I'd like to know the reason behind the slowness.
cc: @Link- , @nikola-jokic
Hey everyone,
I transferred this issue here since it is related to container hook, and not ARC. Most likely, the latency comes from k8s itself where mounting NFS is slow to mount across multiple nodes. We need to find a better way to allow workflow pods to land on different nodes without having to rely on volume. The runner and the workflow pod have to share some files, but we can probably find another solution without relying on RWX volumes
Thanks, @nikola-jokic . Just to add another piece of info: during these two or so minutes (in my case) at the "Initialize container" step, a workflow pod does not show up in K8s. It appears that mounting the same volume on a second node affects the scheduling time of corresponding workflow pods (creating a PVC and initially attaching it to a runner pod is fast).
hi @nikola-jokic . Is there an official plan and ETA to move away from RWX volumes when ACTIONS_RUNNER_USE_KUBE_SCHEDULER is enabled ?
The current pairing of runner and job/workflow pods makes it problematic to schedule pods when they have resource requests/limits set. For example we would get into a situation where runner pod fits a node, but the job/workflow pod can't fit on the same node due to resource requests. In this case the job fails because the workflow was never scheduled. We use Karpeneter which exacerbates the issue even further since it keeps node utilization pretty high >80%. So, most of job/workflow pods would fail to schedule on the same node.
Then we tried with ACTIONS_RUNNER_USE_KUBE_SCHEDULER and RWX volume, but had to give up on it due to abysmal performance of EFS, especially bad when when it's a large repo with a lot of small files.
It really feels like there is no good way to get resource requests/limits utilized with ARC.
@jonathan-fileread @alexgaganashvili @zarko-a I'm struggling with the exact same issues about ARC kubernetes mode. To find a solution for managing resource requests/limits on the workflow pod, I tried to switch to an RWX volume for sharing the workspace directory. However I've never reached the state where my two pods, the runner and the workflow, were able to schedule independently on different nodes, because for some reason ARC is adding this kind of nodeAffinity to the workflow pod :
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- <node-name-where-runner-pod-is-scheduled>
How did you manage to make it work properly ?
@LeonoreMangold Besides having RWX volumes, you need to set ACTIONS_RUNNER_USE_KUBE_SCHEDULER to true in runner scaleSet values file (assuming it's installed via Helm chart).
That worked for us with NFS backed RWX volume, but was way too slow.
Alternatively, they just merged a different mode that doesn't need shared volumes at all. It hasn't make it into a release yet, but it may be worth waiting and trying it out instead.
I stand corrected, it looks like it was released 5 hours ago. https://github.com/actions/runner-container-hooks/releases/tag/v0.8.0
@zarko-a thanks for your answer ! Meanwhile I found the "feature" responsible for adding this node affinity hindering the scheduling on different nodes : cf https://github.com/actions/runner-container-hooks/issues/201.
And I was also waiting eagerly for an alternative to shared volumes, will look at that immediately thanks for the heads-up !