runner-container-hooks
runner-container-hooks copied to clipboard
Enforce same node when turning K8s Scheduler on for Workflow Pods
What would you like added?
As described here: https://github.com/actions/runner-container-hooks/pull/111#issue-1956578120, a workflow pod can currently be placed on any Node of the given K8s cluster. I want to enforce having the pod scheduled via K8s Schedule on the same node.
Why is this needed?
Currently, there are 2 Options with various downsides:
- Bypassing the K8s Scheduler (default behaviour), which can lead to Pods failing with "OutOfCpu" and failing for example: https://github.com/actions/runner-container-hooks/issues/112
- Enable the "useK8sScheduler" Option: https://github.com/actions/runner-container-hooks/pull/111. The downside of this feature is that the pods can be placed on any Node on the cluster, which further implies that the consumer has to provide read-write-many resources.
I suggest extending the "useK8sScheduler" option to generate podAfinities into this functionality. It should be rather straightforward to force the K8s Scheduler the pod on the same Node. The benefit is that if there are not enough resources on this node (at the moment of pod creation), the pod will go into a Pending state and will not fail.
Additional context
This could be easily achieved via a PodAfinity on the "hostname" label, which should be available by default on all K8s Distributions.
I would be open to contributing something similar like that snipplet to https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/k8s/index.ts#L98
if (useKubeScheduler()) {
const affinity = new k8s.V1Affinity()
affinity.nodeAffinity = new k8s.V1NodeAffinity()
const nodeSelector = new k8s.V1NodeSelector()
const term = new k8s.V1NodeSelectorTerm()
const req = new k8s.V1NodeSelectorRequirement()
req.key = 'kubernetes.io/hostname'
req.values = [await getCurrentNodeName()]
req.operator = 'IN'
term.matchExpressions = [req]
nodeSelector.nodeSelectorTerms = [term]
affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution =
nodeSelector
appPod.spec.affinity = affinity
} else {
appPod.spec.nodeName = await getCurrentNodeName()
}
Please share some feedback or ideas on that problem; Thanks a lot in advance
@mzagozen / @pje / @chrispat / @DanRigby / @joshmgross : may I kindly ask you for feedback on this Feature Request?
I think this caused an regression with our runners. We have container jobs that need to run on GPU nodes. The -runner pod gets scheduled on CPU nodes, so adding this affinity causes the -workflow pod (which has gpu requests) to also attempt to be scheduled on the same CPU node.
Let me know if this makes sense.
Our set up:
ACTIONS_RUNNER_USE_KUBE_SCHEDULER=trueACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATEpoints to a YAML with [1]- runner template uses
containerMode: kubernetes - runner template uses RWX volume
[1]
resources:
limits:
cpu: "16"
memory: 300Gi
nvidia.com/gpu: "2"
requests:
cpu: "16"
memory: 300Gi
nvidia.com/gpu: "2"
[2]
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "filestore-gha"
resources:
requests:
storage: 1Gi
GPU is a scarce resource. We have many hosts w/ fully occupied GPUs and only few hosts w/ free GPUs. runner pods can be scheduled on any node. They will accept workloads and if workflow pod is restricted to the same node - they will never run. Disaggregating runner and workflow pods allowed us to solve this issue. We should at least have an option to go back to the previous behavior.
Bump, any suggestions? Can we roll this back / fix?
I've created #235 to be able to disable this.
I'm also impacted by the pod affinity that forces the workflow pod on the same node than the runner. We are using RWX volume to be able to schedule both pods on different nodes with adapted CPU/memory resources. When will a fix for this be released ?