runner-container-hooks icon indicating copy to clipboard operation
runner-container-hooks copied to clipboard

Enforce same node when turning K8s Scheduler on for Workflow Pods

Open BerndFarkaDyna opened this issue 10 months ago • 2 comments

What would you like added?

As described here: https://github.com/actions/runner-container-hooks/pull/111#issue-1956578120, a workflow pod can currently be placed on any Node of the given K8s cluster. I want to enforce having the pod scheduled via K8s Schedule on the same node.

Why is this needed?

Currently, there are 2 Options with various downsides:

  • Bypassing the K8s Scheduler (default behaviour), which can lead to Pods failing with "OutOfCpu" and failing for example: https://github.com/actions/runner-container-hooks/issues/112
  • Enable the "useK8sScheduler" Option: https://github.com/actions/runner-container-hooks/pull/111. The downside of this feature is that the pods can be placed on any Node on the cluster, which further implies that the consumer has to provide read-write-many resources.

I suggest extending the "useK8sScheduler" option to generate podAfinities into this functionality. It should be rather straightforward to force the K8s Scheduler the pod on the same Node. The benefit is that if there are not enough resources on this node (at the moment of pod creation), the pod will go into a Pending state and will not fail.

Additional context

This could be easily achieved via a PodAfinity on the "hostname" label, which should be available by default on all K8s Distributions.

BerndFarkaDyna avatar Jan 22 '25 15:01 BerndFarkaDyna

I would be open to contributing something similar like that snipplet to https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/k8s/index.ts#L98

  if (useKubeScheduler()) {
    const affinity = new k8s.V1Affinity()
    affinity.nodeAffinity = new k8s.V1NodeAffinity()
    const nodeSelector = new k8s.V1NodeSelector()

    const term = new k8s.V1NodeSelectorTerm()
    const req = new k8s.V1NodeSelectorRequirement()
    req.key = 'kubernetes.io/hostname'
    req.values = [await getCurrentNodeName()]
    req.operator = 'IN'
    term.matchExpressions = [req]
    nodeSelector.nodeSelectorTerms = [term]
    affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution =
      nodeSelector
    appPod.spec.affinity = affinity
  } else {
    appPod.spec.nodeName = await getCurrentNodeName()
  }

Please share some feedback or ideas on that problem; Thanks a lot in advance

BerndFarkaDyna avatar Jan 22 '25 18:01 BerndFarkaDyna

@mzagozen / @pje / @chrispat / @DanRigby / @joshmgross : may I kindly ask you for feedback on this Feature Request?

BerndFarkaDyna avatar Jan 27 '25 15:01 BerndFarkaDyna

I think this caused an regression with our runners. We have container jobs that need to run on GPU nodes. The -runner pod gets scheduled on CPU nodes, so adding this affinity causes the -workflow pod (which has gpu requests) to also attempt to be scheduled on the same CPU node. Let me know if this makes sense.

Our set up:

  • ACTIONS_RUNNER_USE_KUBE_SCHEDULER=true
  • ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE points to a YAML with [1]
  • runner template uses containerMode: kubernetes
  • runner template uses RWX volume

[1]

        resources:
          limits:
            cpu: "16"
            memory: 300Gi
            nvidia.com/gpu: "2"
          requests:
            cpu: "16"
            memory: 300Gi
            nvidia.com/gpu: "2"

[2]

    volumes: 
      - name: work
        ephemeral:
          volumeClaimTemplate: 
            spec:
              accessModes: [ "ReadWriteMany" ] 
              storageClassName: "filestore-gha" 
              resources:
                requests:
                  storage: 1Gi

zchenyu avatar May 13 '25 18:05 zchenyu

GPU is a scarce resource. We have many hosts w/ fully occupied GPUs and only few hosts w/ free GPUs. runner pods can be scheduled on any node. They will accept workloads and if workflow pod is restricted to the same node - they will never run. Disaggregating runner and workflow pods allowed us to solve this issue. We should at least have an option to go back to the previous behavior.

divchenko avatar May 13 '25 19:05 divchenko

Bump, any suggestions? Can we roll this back / fix?

zchenyu avatar Jul 14 '25 16:07 zchenyu

I've created #235 to be able to disable this.

Wielewout avatar Jul 16 '25 09:07 Wielewout

I'm also impacted by the pod affinity that forces the workflow pod on the same node than the runner. We are using RWX volume to be able to schedule both pods on different nodes with adapted CPU/memory resources. When will a fix for this be released ?

LeonoreMangold avatar Oct 06 '25 14:10 LeonoreMangold