actions-runner-controller Cannot set resources Requests and Limits for workflow pods

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.9.2

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy the gha-runner-scale-set-controller first with the below command.
   helm install arc . -f values.yaml -narc-systems

2. Deploying the gha-runner-scale-set with Kubernetes mode enabled.
   helm install arc-runner-set . -f values-kubernetes.yaml -narc-runners


Ideal scenario: The workflow pods which comes up should have requested resources and limits set.

Describe the bug

The runner pods, which have names ending with "workflow," should have the specified resource requests and limits for CPU and memory when they are created.

  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"

Describe the expected behavior

The workflow pod that is created during the pipeline execution should have specific CPU and memory limits and requests set. However, it is not starting with the specified resources and limits.

Additionally, an extra pod is being created when the pipeline runs, alongside the existing runner pods. We need to understand the purpose of the existing runner pod if a new pod is also being initiated. Added the detail of the extra pod in the screenshot below.

Additional Context

Adding the value.yaml file for gha-runner-scale-set below.


## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
githubConfigUrl: "https://github.com/curefit"

## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
githubConfigSecret:
  ### GitHub Apps Configuration
  ## NOTE: IDs MUST be strings, use quotes
  #github_app_id: ""
  #github_app_installation_id: ""
  #github_app_private_key: |

  ### GitHub PAT Configuration
  github_token: ""
## If you have a pre-define Kubernetes secret in the same namespace the gha-runner-scale-set is going to deploy,
## you can also reference it via `githubConfigSecret: pre-defined-secret`.
## You need to make sure your predefined secret has all the required secret data set properly.
##   For a pre-defined secret using GitHub PAT, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_token='ghp_your_pat'
##   For a pre-defined secret using GitHub App, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_app_id=123456 --from-literal=github_app_installation_id=654321 --from-literal=github_app_private_key='-----BEGIN CERTIFICATE-----*******'
# githubConfigSecret: pre-defined-secret

## proxy can be used to define proxy settings that will be used by the
## controller, the listener and the runner of this scale set.
#
# proxy:
#   http:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   https:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   noProxy:
#     - example.com
#     - example.org

# maxRunners is the max number of runners the autoscaling runner set will scale up to.
# maxRunners: 5

# minRunners is the min number of idle runners. The target number of runners created will be
# calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 3

runnerGroup: "arc-runner-kubernetes-ci-arm-large"

# ## name of the runner scale set to create.  Defaults to the helm release name
runnerScaleSetName: "arc-runner-kubernetes-ci-arm-large"

## A self-signed CA certificate for communication with the GitHub server can be
## provided using a config map key selector. If `runnerMountPath` is set, for
## each runner pod ARC will:
## - create a `github-server-tls-cert` volume containing the certificate
##   specified in `certificateFrom`
## - mount that volume on path `runnerMountPath`/{certificate name}
## - set NODE_EXTRA_CA_CERTS environment variable to that same path
## - set RUNNER_UPDATE_CA_CERTS environment variable to "1" (as of version
##   2.303.0 this will instruct the runner to reload certificates on the host)
##
## If any of the above had already been set by the user in the runner pod
## template, ARC will observe those and not overwrite them.
## Example configuration:
#
# githubServerTLS:
#   certificateFrom:
#     configMapKeyRef:
#       name: config-map-name
#       key: ca.crt
#   runnerMountPath: /usr/local/share/ca-certificates/

## Container mode is an object that provides out-of-box configuration
## for dind and kubernetes mode. Template will be modified as documented under the
## template object.
##
## If any customization is required for dind or kubernetes mode, containerMode should remain
## empty, and configuration should be applied to the template.
containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "gp3"
    resources:
      requests:
        storage: 5Gi
#   kubernetesModeServiceAccount:
#     annotations:

## listenerTemplate is the PodSpec for each listener Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
# listenerTemplate:
#   spec:
#     containers:
#     # Use this section to append additional configuration to the listener container.
#     # If you change the name of the container, the configuration will not be applied to the listener,
#     # and it will be treated as a side-car container.
#     - name: listener
#       securityContext:
#         runAsUser: 1000
#     # Use this section to add the configuration of a side-car container.
#     # Comment it out or remove it if you don't need it.
#     # Spec for this container will be applied as is without any modifications.
#     - name: side-car
#       image: example-sidecar

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  ## template.spec will be modified if you change the container mode
  ## with containerMode.type=dind, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     initContainers:
  ##     - name: init-dind-externals
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
  ##       volumeMounts:
  ##         - name: dind-externals
  ##           mountPath: /home/runner/tmpDir
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       env:
  ##         - name: DOCKER_HOST
  ##           value: unix:///var/run/docker.sock
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##     - name: dind
  ##       image: docker:dind
  ##       args:
  ##         - dockerd
  ##         - --host=unix:///var/run/docker.sock
  ##         - --group=$(DOCKER_GROUP_GID)
  ##       env:
  ##         - name: DOCKER_GROUP_GID
  ##           value: "123"
  ##       securityContext:
  ##         privileged: true
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##         - name: dind-externals
  ##           mountPath: /home/runner/externals
  ##     volumes:
  ##     - name: work
  ##       emptyDir: {}
  ##     - name: dind-sock
  ##       emptyDir: {}
  ##     - name: dind-externals
  ##       emptyDir: {}
  ######################################################################################################
  ## with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"  
  ##       env:
  ##         - name: ACTIONS_RUNNER_CONTAINER_HOOKS
  ##           value: /home/runner/k8s/index.js
  ##         - name: ACTIONS_RUNNER_POD_NAME
  ##           valueFrom:
  ##             fieldRef:
  ##               fieldPath: metadata.name
  ##         - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
  ##           value: "true"
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##     volumes:
  ##       - name: work
  ##         ephemeral:
  ##           volumeClaimTemplate:
  ##             spec:
  ##               accessModes: [ "ReadWriteOnce" ]
  ##               storageClassName: "local-path"
  ##               resources:
  ##                 requests:
  ##                   storage: 1Gi
  spec:
    securityContext:
      fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  
    nodeSelector:
      purpose: github-actions-arm-large
    tolerations:
      - key: purpose
        operator: Equal
        value: github-actions-arm-large
        effect: NoSchedule       
## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
# controllerServiceAccount:
#   namespace: arc-system
#   name: test-arc-gha-runner-scale-set-controller



And have specidfically mentioned the resources in the kubernetes section:
  ##       resources:
  ##         requests:
  ##           memory: "4Gi"
  ##           cpu: "2"
  ##         limits:
  ##           memory: "6Gi"
  ##           cpu: "4"

Controller Logs

https://gist.github.com/kanakaraju17/31a15aa0a1b5a04fb7eaab6996c02d40

[this is not related to the resource request constraint for the runner pods]

Runner Pod Logs

https://gist.github.com/kanakaraju17/c33c0012f80a48a1e4504bd241c278cc

Jul 04 '24 11:07 kanakaraju17

you need to define those in your podtemplate after declaring the podtemplate yml in the scalesetrunner values.yaml. (terraform below btw)

name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE value: /home/runner/pod-templates/default.yml

Screenshot 2024-07-05 at 5 07 03 PM

Jul 05 '24 21:07 jonathan-fileread

Hey @jonathan-fileread, is there a way to configure this in the default values.yaml file provided with the gha-runner-scale-set charts?

Jul 07 '24 14:07 ghost

@kanakaraju17 Hey Kanaka, unfortunately not. you need to create a seperate podtemplate in order to define the workflow pod, as the values.yaml only defines the runner pod settings.

Jul 08 '24 20:07 jonathan-fileread

@jonathan-fileread, any idea why the file is not getting mounted in the runner pods? I'm using the following configuration and encountering the error below:

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  # with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
  template:
    spec:  
      securityContext:
        fsGroup: 123      
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/pod-templates/default.yml
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"      
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true  
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                storageClassName: "gp3"
                resources:
                  requests:
                    storage: 1Gi
        - name: pod-templates
          configMap:
            name: runner-pod-template

ConfigMap Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: runner-pod-template
data:
  default.yml: |
    apiVersion: v1
    kind: PodTemplate
    metadata:
      name: runner-pod-template
    spec:
      containers:
      - name: "$job"
        resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

The pods fail and end up with the below error:

Error: Error: ENOENT: no such file or directory, open '/home/runner/pod-templates/default.yml'
Error: Process completed with exit code 1.

Have you tried recreating it in your environment? Have you come across this error before? It seems to be a mounting issue where the file is not found.

Jul 09 '24 07:07 ghost

@kanakaraju17 You can follow the official guide which worked for me at least :)

https://docs.github.com/en/[email protected]/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#understanding-runner-container-hooks

In your case that would be something like:

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
data:
  content: |
    spec:
      containers:
        - name: "$job"
          resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

Usage:

template:
    spec:
      containers:
      - name: runner
        ...
        env:
          ...
         - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
           value: /home/runner/pod-template/content
        volumeMounts:
          ...
          - name: pod-template
            mountPath: /home/runner/pod-template
            readOnly: true  
      volumes:
        ...
        - name: pod-template
          configMap:
            name: hook-extension

Jul 15 '24 12:07 georgblumenschein

Hey @georgblumenschein, Deploying the gha-runner-scale-set by adding the below env variables doesn't seem to reflect.

template:
  template:
    spec:
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"

Additional ENV Variable Added:

          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content

The workflow pods should include the ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and volume mount but it doesn't appear when describing the pods. Currently, the output is missing this variable.

Expected Result: The ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and the volume mounts in the workflow pods should be present.

Below are the values.yaml template used to append the environment variable:

template:
  template:
    spec:
      containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-template/content
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-template
            mountPath: /home/runner/pod-template
            readOnly: true  
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                storageClassName: "local-path"
                resources:
                  requests:
                    storage: 1Gi
        - name: pod-template
          configMap:
            name: hook-extension

Problem: The pods should have the volumes mounted with the config map and the specified environment variables added. However, this is not happening as expected.

Current Output:

While Describing the AutoscalingRunnerSet doesn't show the ENV variables added either.

Name:         arc-runner-kubernetes-ci-arm-large
Namespace:    arc-runners-kubernetes-arm
Labels:       actions.github.com/organization=curefit
              actions.github.com/scale-set-name=arc-runner-kubernetes-ci-arm-large
              actions.github.com/scale-set-namespace=arc-runners-kubernetes-arm
              app.kubernetes.io/component=autoscaling-runner-set
              app.kubernetes.io/instance=arc-runner-kubernetes-ci-arm-large
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=arc-runner-kubernetes-ci-arm-large
              app.kubernetes.io/part-of=gha-rs
              app.kubernetes.io/version=0.9.3
              helm.sh/chart=gha-rs-0.9.3
Annotations:  actions.github.com/cleanup-kubernetes-mode-role-binding-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-service-account-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
              actions.github.com/cleanup-manager-role-binding: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
              actions.github.com/cleanup-manager-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
              actions.github.com/runner-group-name: arc-runner-kubernetes-ci-arm-large
              actions.github.com/runner-scale-set-name: arc-runner-kubernetes-ci-arm-large
              actions.github.com/values-hash: 8b5caae634d958cc7d295b3166c151d036c7896d2b6165bf908a6a24aec5320
              meta.helm.sh/release-name: arc-runner-set-kubernetes-arm-large
              meta.helm.sh/release-namespace: arc-runners-kubernetes-arm
              runner-scale-set-id: 76
API Version:  actions.github.com/v1alpha1
Kind:         AutoscalingRunnerSet
Metadata:
  Creation Timestamp:  2024-07-16T09:49:56Z
  Finalizers:
    autoscalingrunnerset.actions.github.com/finalizer
  Generation:        1
  Resource Version:  577760766
  UID:               165f74f7-875c-4b8f-a214-96948ec38467
Spec:
  Github Config Secret:  github-token
  Github Config URL:     https://github.com/curefit
  Listener Template:
    Spec:
      Containers:
        Name:  listener
        Resources:
          Limits:
            Cpu:     500m
            Memory:  500Mi
          Requests:
            Cpu:     250m
            Memory:  250Mi
      Node Selector:
        Purpose:  github-actions
      Tolerations:
        Effect:           NoSchedule
        Key:              purpose
        Operator:         Equal
        Value:            github-actions
  Min Runners:            2
  Runner Group:           arc-runner-kubernetes-ci-arm-large
  Runner Scale Set Name:  arc-runner-kubernetes-ci-arm-large
  Template:
    Spec:
      Containers:
        Command:
          /home/runner/run.sh
        Env:
          Name:   ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          Value:  false
          Name:   ACTIONS_RUNNER_CONTAINER_HOOKS
          Value:  /home/runner/k8s/index.js
          Name:   ACTIONS_RUNNER_POD_NAME
          Value From:
            Field Ref:
              Field Path:  metadata.name
        Image:             ghcr.io/actions/actions-runner:latest
        Name:              runner
        Volume Mounts:
          Mount Path:  /home/runner/_work
          Name:        work
      Node Selector:
        Purpose:       github-actions
      Restart Policy:  Never
      Security Context:
        Fs Group:            1001
      Service Account Name:  arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
      Tolerations:
        Effect:    NoSchedule
        Key:       purpose
        Operator:  Equal
        Value:     github-actions
      Volumes:
        Ephemeral:
          Volume Claim Template:
            Spec:
              Access Modes:
                ReadWriteOnce
              Resources:
                Requests:
                  Storage:         5Gi
              Storage Class Name:  gp3
        Name:                      work
Status:
  Current Runners:            2
  Pending Ephemeral Runners:  2
Events:                       <none>
Below is the configmap file which is being used:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
  namespace: arc-runners-kubernetes-arm
data:
  content: |
    spec:
      containers:
        - name: "$job"
          resources:
          limits:
            cpu: "3000m"
          requests:
            cpu: "3000m"

expected behavior: The ENV variable ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE getting added along with the volume mounts along the pods which will come up.

Jul 16 '24 11:07 ghost

Hey @kanakaraju17 ,

After 2 days of trail and error I managed to get a working scenario with resource limits applied. Funny thing is we were overcomplicating it using the "hook-exensions". All we need to is add it in the template.spec.containers[0].resources.requests/limits section.

Below is a snippet of the values to pass into Helm (although I am using a HelmRelease with FluxCD, the principle still applies):

  values:
    containerMode:
      type: "kubernetes"
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "standard"
        resources:
          requests:
            storage: 10Gi
    githubConfigSecret: gh-secret
    githubConfigUrl : "https://github.com/<Organisation>"
    runnerGroup: "k8s-nonprod"
    runnerScaleSetName: "self-hosted-k8s" # used as a runner label
    minRunners: 1
    maxRunners: 10
    template:
      spec:
        securityContext:
          fsGroup: 1001
        imagePullSecrets:
          - name: cr-secret
        containers:
          - name: runner
            image: ghcr.io/actions/actions-runner:latest
            command: ["/home/runner/run.sh"]
            resources:
              limits:
                cpu: "2000m"
                memory: "5Gi"
              requests:
                cpu: "200m"
                memory: "512Mi"

I have confirmed that this has been working for me with some CodeQL workflows failing due to "insufficient RAM" lol.

Hope it helps.

Aug 27 '24 12:08 marcomarques-bt

@marcomarques-bt, I assume that the above configuration works only for runner pods and not the pods where the workflow runs i.e. the workflow pods. The above only works for runner pods.

Refer to the image below, the configuration works for the first pod and not the second pod where the actual job runs.

Aug 29 '24 19:08 ghost

It seems that, similar to the issue mentioned earlier, toleration cannot be configured either.

Oct 08 '24 08:10 pyama86

:wave: Hey, thanks for opening this topic.

I have managed to get this going but we have some large runners and we ran into an issue where if there are not resources available on the node the workflow pod fails to schedule...

Error: Error: pod failed to come online with error: Error: Pod lendable-large-x64-linux-dev-h8727-runner-thwgr-workflow is unhealthy with phase status Failed

and it needs to be scheduled on the same node as the runner because of the pvc. This whole thing doesn't make much sense. We want people to specify for example a large runner in kubernetes mode and at the end they we get an idle pod that just tries to spin up a new pod.

Dec 17 '24 15:12 velkovb

@kanakaraju17 thanks for opening this issue. Did you ever find a mechanism to enforce resource limits?

Dec 20 '24 22:12 cboettig

for @cboettig and those following this thread and the interaction between @kanakaraju17 and @georgblumenschein, I have made it work with the following configuration, I am sharing it as json as it's more clear that the configmap is properly formatted:

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
data:
  content: |
    {
      "spec": {
        "containers": [
          {
            "name": "$job",
            "resources": {
              "requests": {
                "cpu": "1000m",
                "memory": "1Gi"
              },
              "limits": {
                "cpu": "2000m",
                "memory": "2Gi"
              }
            }
          }
        ]
      }
    }

runner-scaleset values:

template:
    spec:
      containers:
      - name: runner
        env:
         - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
           value: /home/runner/pod-template/content
        volumeMounts:
          - name: pod-template
            mountPath: /home/runner/pod-template
            readOnly: true  
      volumes:
        - name: pod-template
          configMap:
            name: hook-extension

This will add the resource requests and limits only for the workflow pods, without wasting resources on runner pods.

Jan 07 '25 01:01 sqr

That is all really good but the moment you set resource on the workflow pod and there is no space on the node hosting the controller pod you are out of luck... It won't wait for available resource but just fail. We are in the process of evaluating the option to use kube scheduler but that requires changing the PVC to RWM which is expensive and has its limitations. We are in AWS and have tried EFS and IO2 but both don't work well.

GitHub should really implement this properly as it is really handicapped at the moment.

Jan 07 '25 08:01 velkovb

@velkovb you are right, by setting requests on the workflow pod but not on the controller pod, we quickly ran into that issue: the controller pod always has room in the node, but the whole action fails if there's no room for its corresponding workflow pod.

So far we have worked around it by assigning requests on the controller pod, and none on the workflow one. That way the workflow pod always has room, and we count on it cannibalizing the resources assigned to the controller pod, since the controller is very lightweight. This is not an ideal solution but the best we can come up without RWX.

What issues have you experienced with IO2? That was my next alternative to try, so we can use kube scheduler and not worry about controller and workflow pods having to land on the same node.

Jan 08 '25 14:01 sqr

@sqr

So far we have worked around it by assigning requests on the controller pod, and none on the workflow one. That way the workflow pod always has room, and we count on it cannibalizing the resources assigned to the controller pod, since the controller is very lightweight. This is not an ideal solution but the best we can come up without RWX.

I don't think I get how that works. If you set requests for the controller pod, won't it actually reserve it for that pod and not give it to anything else? I would see it work for CPU but not sure it does for memory?

What issues have you experienced with IO2? That was my next alternative to try, so we can use kube scheduler and not worry about controller and workflow pods having to land on the same node.

Multi-attach works in block mode and the volumeMounts that the hooks do for the workflow pods do a lot of path mapping that would be hard to replicate with no guaranteed result and you still have the AZ restriction as EBS is zonal. EFS was just slow.

Jan 08 '25 16:01 velkovb

@velkovb

I don't think I get how that works. If you set requests for the controller pod, won't it actually reserve it for that pod and not give it to anything else? I would see it work for CPU but not sure it does for memory?

The requests guarantee that the specified amount of cpu is available at scheduling time, but if the workflow pod requires cpu time and the controller is idle, it will take it from it. This is not the case for memory, which is why I have only set requests for cpu

Jan 08 '25 17:01 sqr

@velkovb We've migrated to a RWX setup with a NFS CSI storage class to avoid the multi-attach error of RWO - however we're experiencing slowness with workflow pods being provisioned (usually takes 3 minutes per github action job).

I suspect it has something to do with FS slowness (not sure if its provisioning, or just using it in general). Do you have any recommendations?

We've opened a ticket here https://github.com/actions/runner-container-hooks/issues/207

Jan 09 '25 16:01 jonathan-fileread

My findings were that the slowness was in the pre-setup function while it is copying the workspace - https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/hooks/prepare-job.ts#L184

The first log message I see after the container starts is - https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/hooks/prepare-job.ts#L45.

The slowness is not in PVC provisioning as that goes really fast. That workspace seems to be only ~250MB so not sure why it is so slow.

Jan 09 '25 17:01 velkovb

This is also bothering me—how something seemingly so basic of a requirement isn’t a standard option out of the box. I’m considering trying the following approaches to ensure the workflow pod will fit (resource-wise) and be scheduled onto the same node:

Use CPU requests on the controller pod combined with affinity rules, and memory requests on the workflow pod, to ensure it gets scheduled on the same node.
Maintain consistent resource ratios. For example, if your nodes are 8 cores x 32 GB of memory, ensure a 1:4 CPU-to-memory ratio. This means increasing CPU allocations proportionally to memory, even if the CPU isn’t fully needed. For instance, if a workload requires only 1 core and 16 GB of memory, specify 4 cores to match the ratio. While this approach can be wasteful, you could configure Karpenter with appropriate node sizes that align with these ratios.

To address different workload needs, I’m planning to define multiple runner scale sets with varying sizes, allowing developers to select the one that fits their requirements:

General Purpose (gp):
- x-small-gp: 500m, 2 GB memory
- small-gp: 1 core, 4 GB memory
- medium-gp: 2 cores, 8 GB memory
High Memory (hm):
- x-small-hm: 500m, 4 GB memory
- small-hm: 1 core, 8 GB memory
- medium-hm: 2 cores, 16 GB memory
High CPU (hc):
- x-small-hc: 500m, 1 GB memory
- small-hc: 1 core, 2 GB memory
- medium-hc: 2 cores, 4 GB memory

Each class maps to specific instance types:

gp: General purpose instances (e.g., m7i.xlarge)
hm: Memory-optimized instances (e.g., r7i.xlarge)
hc: Compute-optimized instances (e.g., c7i.xlarge)

Has anyone else approached this problem in a similar way? If so, I’d love to hear any pointers or lessons learned. Also, if anyone sees potential holes in my plan or areas for improvement, please let me know!

Jan 22 '25 17:01 jasonwbarnett

@jasonwbarnett We started with similar more granular approach to resource ratios but noticed that it was not followed strictly and actually kubernetes nodes had a lot of free resources. Besides what you have in mind we had further break down for dind and arm runners so that ended up with a lot of different scalesets (more than 30). At the end we did more of a T-shirt sizing approach which have just small, medium and large (both x64 and arm64) and options with and without dind. No resource limits are set on the pods so they can utilize whatever is free on the nodes. We didn't want to separate them on different node pools are that will decrease the efficiency and we are running one large spot node pool.

The idea with having CPU on the controller and memory on the workflow pod is a good workaround :)

Jan 22 '25 18:01 velkovb

@velkovb Thanks for sharing your experience! It’s interesting to hear that a more granular approach led to inefficiencies due to underutilized resources. I can definitely see how managing over 30 scale sets could become unwieldy. Your T-shirt sizing approach with small, medium, and large options (and differentiating x64/arm64 and dind/non-dind) sounds like a practical way to simplify things while still offering flexibility. Did you have the scalesets mapped to specific karpenter nodepools or what?

I hadn’t considered not setting resource limits on the pods to allow them to utilize free node resources—that’s an intriguing idea. I imagine it works well with your setup of a single large spot node pool, as you can maximize utilization without worrying too much about strict separation.

Thanks for the feedback on the controller CPU and workflow pod memory approach! I’ll experiment further with these ideas and keep the potential for over-complication in mind. If you don’t mind me asking, how do you handle scaling with the T-shirt sizes—do you find it works well with just the large node pool, or are there edge cases where it gets tricky?

Jan 23 '25 22:01 jasonwbarnett

@jasonwbarnett Just one large node spot node pool and we have an on-demand one for edge cases of really long running jobs. We were monitoring our nodes and the resource usage was rarely going above 30% and that made us try no resource limits. For workloads we always set memory request = memory limit but here due to the short lifetime of a job pod we believe it wouldn't be a problem. We run roughly 20k jobs a day and so far it seems to be working fine :) In our config we have some warm runners for the most used types and we use overprovisioning to keep 1-2 warm nodes (as that is usually the slowest thing).

Jan 24 '25 06:01 velkovb

Closing this issue since it is not related to ARC, and this comment describes the intended way to set requests/limits for workflow pods. Thank you for providing a solution!

Mar 24 '25 13:03 nikola-jokic

Closing this issue since it is not related to ARC, and https://github.com/actions/actions-runner-controller/issues/3641#issuecomment-2574223158 comment describes the intended way to set requests/limits for workflow pods. Thank you for providing a solution!

@nikola-jokic I would argue that the issue should stay open. There is still no way to set workflow pods resources. This is just a hacky workaround to load a configuration file that is in no way dynamic. The final solution should allow us to set the workflow pod in the actual GitHub workflow (or get rid of this workflow pod non-sense in general :) ).

Mar 24 '25 13:03 velkovb

Hey @velkovb,

I completely agree with you here, there is currently no way we can specify the workflow pod requests/limits. However, ARC is not responsible for it. The container hook is. ARC doesn't have any special handling of the runner configured in containerMode kubernetes. It simply provides the out-of-box spec that can be used to configure the runner with the container hook.

Can you please help me understand what do you mean to get rid of the workflow pod. The workflow pod is the host container where the hook execs into and runs the commands. That was the initial purpose of the kubernetes mode, to avoid running the workflow container using dind. If you want to run dind, or configure it using rootless, you certainly can do that. But the workflow pod is definitely a requirement for k8s mode, so I don't see how we can get rid of it.

To provide more context, to set requests/limits using a workflow, we would have to either extend the workflow syntax to support these cases, or use subset of docker options to translate them to the workflow spec. The first approach requires touching multiple parts of the system, and should be planned for and prioritized. The second approach is kind of a hack. We only need a subset of options to specify requests and limits. The container hook extension was the quick approach (and definitely not a complete approach I admit that) to limit workflow pods on your cluster, and to provide a way to configure service account for it, etc.

To re-iterate, container mode issues should not be opened in this repo, but rather in the runner-container-hooks repo. Feature requests and feedback should always be submitted on the Community Support Forum. Please submit your feedback there and thank you for participating in these discussions!

Mar 24 '25 14:03 nikola-jokic

I meant that the way Kubernetes mode works for Github actions doesn't feel natural. I would expect when I specify a container for my job that it should directly run inside that container and wouldn't expect that there would be a middle man (a controller pod that spins up a workflows pod...) . Other CI systems support that and it makes things a lot more simpler.

P.S. Furthermore, the comments like this is not an ARC issue increase the feeling that even though GitHub adopted the project it is not a first class citizen. Issues are being closed, PRs are not merged, the general impression is not good.

Mar 24 '25 14:03 velkovb

But this is the issue board for ARC, it doesn't make sense to me to keep issues related to other projects here. The container hook is a separate project. The runner is also a separate project. The responsibility of ARC is to spin up runner images and scale based on the demand. Let's say there is a runner issue. It should be submitted to the runner repo. It doesn't matter if the ARC is the one spinning up these runners. If this were the container hook issue, I would transfer it to the container hook repo. But this is an enhancement, so I pointed out where it should be submitted.

Mar 24 '25 14:03 nikola-jokic

I fully back what @velkovb is saying.

It's extremely inconvenient to setup ARC for an organization where each repository has their own CI jobs with custom podSpec and most importantly CPU/memoryrequirements for each of their CI job.

Not to mention that CPU/memory requirements can change within a Pull Request, ie new heavy tests are added or a big build step and what are you gonna do with ARC?

Ask k8s devops team in the Org to add another type of runner small/medium/etc?
ARC CI configuration breaks Git Ops — it's detached from the actual Repo that uses it which violates so many Devops principles.

Many other CI systems support per-job podSpec for their K8S controllers without issues:

https://buildkite.com/docs/agent/v3/agent-stack-k8s/podspec
https://github.com/EmbarkStudios/k8s-buildkite-plugin (I actually implemented this one in 300 lines of Bash with podSpecPatch per Job!)
https://www.jenkins.io/doc/pipeline/steps/kubernetes/#podtemplate-define-a-podtemplate-to-use-in-the-kubernetes-plugin
https://tekton.dev/vault/pipelines-main/compute-resources/?utm_source=chatgpt.com#configure-task-level-compute-resources
https://argo-workflows.readthedocs.io/en/latest/fields/

Just to list a few.

Instead of maintaining "Pools" of runners, their controllers simply schedule K8S pods/jobs per job with binding parameters so that the CI agent only picks up the required job.

I think per-job podSpec patch must be considered as part of ARC architecture for actually native K8S integration.

Sep 05 '25 22:09 artem-zinnatullin

actions-runner-controller actions-runner-controller copied to clipboard

Cannot set resources Requests and Limits for workflow pods

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

actions-runner-controller
actions-runner-controller copied to clipboard