actions-runner-controller
actions-runner-controller copied to clipboard
Cannot set resources Requests and Limits for workflow pods
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.2
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Deploy the gha-runner-scale-set-controller first with the below command.
helm install arc . -f values.yaml -narc-systems
2. Deploying the gha-runner-scale-set with Kubernetes mode enabled.
helm install arc-runner-set . -f values-kubernetes.yaml -narc-runners
Ideal scenario: The workflow pods which comes up should have requested resources and limits set.
Describe the bug
The runner pods, which have names ending with "workflow," should have the specified resource requests and limits for CPU and memory when they are created.
## resources:
## requests:
## memory: "4Gi"
## cpu: "2"
## limits:
## memory: "6Gi"
## cpu: "4"
Describe the expected behavior
The workflow pod that is created during the pipeline execution should have specific CPU and memory limits and requests set. However, it is not starting with the specified resources and limits.
Additionally, an extra pod is being created when the pipeline runs, alongside the existing runner pods. We need to understand the purpose of the existing runner pod if a new pod is also being initiated. Added the detail of the extra pod in the screenshot below.
Additional Context
Adding the value.yaml file for gha-runner-scale-set below.
## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
githubConfigUrl: "https://github.com/curefit"
## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
githubConfigSecret:
### GitHub Apps Configuration
## NOTE: IDs MUST be strings, use quotes
#github_app_id: ""
#github_app_installation_id: ""
#github_app_private_key: |
### GitHub PAT Configuration
github_token: ""
## If you have a pre-define Kubernetes secret in the same namespace the gha-runner-scale-set is going to deploy,
## you can also reference it via `githubConfigSecret: pre-defined-secret`.
## You need to make sure your predefined secret has all the required secret data set properly.
## For a pre-defined secret using GitHub PAT, the secret needs to be created like this:
## > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_token='ghp_your_pat'
## For a pre-defined secret using GitHub App, the secret needs to be created like this:
## > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_app_id=123456 --from-literal=github_app_installation_id=654321 --from-literal=github_app_private_key='-----BEGIN CERTIFICATE-----*******'
# githubConfigSecret: pre-defined-secret
## proxy can be used to define proxy settings that will be used by the
## controller, the listener and the runner of this scale set.
#
# proxy:
# http:
# url: http://proxy.com:1234
# credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
# https:
# url: http://proxy.com:1234
# credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
# noProxy:
# - example.com
# - example.org
# maxRunners is the max number of runners the autoscaling runner set will scale up to.
# maxRunners: 5
# minRunners is the min number of idle runners. The target number of runners created will be
# calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 3
runnerGroup: "arc-runner-kubernetes-ci-arm-large"
# ## name of the runner scale set to create. Defaults to the helm release name
runnerScaleSetName: "arc-runner-kubernetes-ci-arm-large"
## A self-signed CA certificate for communication with the GitHub server can be
## provided using a config map key selector. If `runnerMountPath` is set, for
## each runner pod ARC will:
## - create a `github-server-tls-cert` volume containing the certificate
## specified in `certificateFrom`
## - mount that volume on path `runnerMountPath`/{certificate name}
## - set NODE_EXTRA_CA_CERTS environment variable to that same path
## - set RUNNER_UPDATE_CA_CERTS environment variable to "1" (as of version
## 2.303.0 this will instruct the runner to reload certificates on the host)
##
## If any of the above had already been set by the user in the runner pod
## template, ARC will observe those and not overwrite them.
## Example configuration:
#
# githubServerTLS:
# certificateFrom:
# configMapKeyRef:
# name: config-map-name
# key: ca.crt
# runnerMountPath: /usr/local/share/ca-certificates/
## Container mode is an object that provides out-of-box configuration
## for dind and kubernetes mode. Template will be modified as documented under the
## template object.
##
## If any customization is required for dind or kubernetes mode, containerMode should remain
## empty, and configuration should be applied to the template.
containerMode:
type: "kubernetes" ## type can be set to dind or kubernetes
## the following is required when containerMode.type=kubernetes
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
# For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
storageClassName: "gp3"
resources:
requests:
storage: 5Gi
# kubernetesModeServiceAccount:
# annotations:
## listenerTemplate is the PodSpec for each listener Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
# listenerTemplate:
# spec:
# containers:
# # Use this section to append additional configuration to the listener container.
# # If you change the name of the container, the configuration will not be applied to the listener,
# # and it will be treated as a side-car container.
# - name: listener
# securityContext:
# runAsUser: 1000
# # Use this section to add the configuration of a side-car container.
# # Comment it out or remove it if you don't need it.
# # Spec for this container will be applied as is without any modifications.
# - name: side-car
# image: example-sidecar
## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
## template.spec will be modified if you change the container mode
## with containerMode.type=dind, we will populate the template.spec with following pod spec
## template:
## spec:
## initContainers:
## - name: init-dind-externals
## image: ghcr.io/actions/actions-runner:latest
## command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
## volumeMounts:
## - name: dind-externals
## mountPath: /home/runner/tmpDir
## containers:
## - name: runner
## image: ghcr.io/actions/actions-runner:latest
## command: ["/home/runner/run.sh"]
## env:
## - name: DOCKER_HOST
## value: unix:///var/run/docker.sock
## volumeMounts:
## - name: work
## mountPath: /home/runner/_work
## - name: dind-sock
## mountPath: /var/run
## - name: dind
## image: docker:dind
## args:
## - dockerd
## - --host=unix:///var/run/docker.sock
## - --group=$(DOCKER_GROUP_GID)
## env:
## - name: DOCKER_GROUP_GID
## value: "123"
## securityContext:
## privileged: true
## volumeMounts:
## - name: work
## mountPath: /home/runner/_work
## - name: dind-sock
## mountPath: /var/run
## - name: dind-externals
## mountPath: /home/runner/externals
## volumes:
## - name: work
## emptyDir: {}
## - name: dind-sock
## emptyDir: {}
## - name: dind-externals
## emptyDir: {}
######################################################################################################
## with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
## template:
## spec:
## containers:
## - name: runner
## image: ghcr.io/actions/actions-runner:latest
## command: ["/home/runner/run.sh"]
## resources:
## requests:
## memory: "4Gi"
## cpu: "2"
## limits:
## memory: "6Gi"
## cpu: "4"
## env:
## - name: ACTIONS_RUNNER_CONTAINER_HOOKS
## value: /home/runner/k8s/index.js
## - name: ACTIONS_RUNNER_POD_NAME
## valueFrom:
## fieldRef:
## fieldPath: metadata.name
## - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
## value: "true"
## volumeMounts:
## - name: work
## mountPath: /home/runner/_work
## volumes:
## - name: work
## ephemeral:
## volumeClaimTemplate:
## spec:
## accessModes: [ "ReadWriteOnce" ]
## storageClassName: "local-path"
## resources:
## requests:
## storage: 1Gi
spec:
securityContext:
fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
nodeSelector:
purpose: github-actions-arm-large
tolerations:
- key: purpose
operator: Equal
value: github-actions-arm-large
effect: NoSchedule
## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
# controllerServiceAccount:
# namespace: arc-system
# name: test-arc-gha-runner-scale-set-controller
And have specidfically mentioned the resources in the kubernetes section:
## resources:
## requests:
## memory: "4Gi"
## cpu: "2"
## limits:
## memory: "6Gi"
## cpu: "4"
Controller Logs
https://gist.github.com/kanakaraju17/31a15aa0a1b5a04fb7eaab6996c02d40
[this is not related to the resource request constraint for the runner pods]
Runner Pod Logs
https://gist.github.com/kanakaraju17/c33c0012f80a48a1e4504bd241c278cc
you need to define those in your podtemplate after declaring the podtemplate yml in the scalesetrunner values.yaml. (terraform below btw)
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE value: /home/runner/pod-templates/default.yml
Hey @jonathan-fileread, is there a way to configure this in the default values.yaml file provided with the gha-runner-scale-set charts?
@kanakaraju17 Hey Kanaka, unfortunately not. you need to create a seperate podtemplate in order to define the workflow pod, as the values.yaml only defines the runner pod settings.
@jonathan-fileread, any idea why the file is not getting mounted in the runner pods? I'm using the following configuration and encountering the error below:
## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
# with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
template:
spec:
securityContext:
fsGroup: 123
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/pod-templates/default.yml
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "gp3"
resources:
requests:
storage: 1Gi
- name: pod-templates
configMap:
name: runner-pod-template
ConfigMap Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: runner-pod-template
data:
default.yml: |
apiVersion: v1
kind: PodTemplate
metadata:
name: runner-pod-template
spec:
containers:
- name: "$job"
resources:
limits:
cpu: "3000m"
requests:
cpu: "3000m"
The pods fail and end up with the below error:
Error: Error: ENOENT: no such file or directory, open '/home/runner/pod-templates/default.yml'
Error: Process completed with exit code 1.
Have you tried recreating it in your environment? Have you come across this error before? It seems to be a mounting issue where the file is not found.
@kanakaraju17 You can follow the official guide which worked for me at least :)
https://docs.github.com/en/[email protected]/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#understanding-runner-container-hooks
In your case that would be something like:
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: hook-extension
data:
content: |
spec:
containers:
- name: "$job"
resources:
limits:
cpu: "3000m"
requests:
cpu: "3000m"
Usage:
template:
spec:
containers:
- name: runner
...
env:
...
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-template/content
volumeMounts:
...
- name: pod-template
mountPath: /home/runner/pod-template
readOnly: true
volumes:
...
- name: pod-template
configMap:
name: hook-extension
Hey @georgblumenschein, Deploying the gha-runner-scale-set by adding the below env variables doesn't seem to reflect.
template:
template:
spec:
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-template/content
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
Additional ENV Variable Added:
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-template/content
The workflow pods should include the ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and volume mount but it doesn't appear when describing the pods. Currently, the output is missing this variable.
Expected Result:
The ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and the volume mounts in the workflow pods should be present.
Below are the values.yaml template used to append the environment variable:
template:
template:
spec:
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-template/content
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: pod-template
mountPath: /home/runner/pod-template
readOnly: true
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "local-path"
resources:
requests:
storage: 1Gi
- name: pod-template
configMap:
name: hook-extension
Problem: The pods should have the volumes mounted with the config map and the specified environment variables added. However, this is not happening as expected.
Current Output:
While Describing the AutoscalingRunnerSet doesn't show the ENV variables added either.
Name: arc-runner-kubernetes-ci-arm-large
Namespace: arc-runners-kubernetes-arm
Labels: actions.github.com/organization=curefit
actions.github.com/scale-set-name=arc-runner-kubernetes-ci-arm-large
actions.github.com/scale-set-namespace=arc-runners-kubernetes-arm
app.kubernetes.io/component=autoscaling-runner-set
app.kubernetes.io/instance=arc-runner-kubernetes-ci-arm-large
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=arc-runner-kubernetes-ci-arm-large
app.kubernetes.io/part-of=gha-rs
app.kubernetes.io/version=0.9.3
helm.sh/chart=gha-rs-0.9.3
Annotations: actions.github.com/cleanup-kubernetes-mode-role-binding-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
actions.github.com/cleanup-kubernetes-mode-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
actions.github.com/cleanup-kubernetes-mode-service-account-name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
actions.github.com/cleanup-manager-role-binding: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
actions.github.com/cleanup-manager-role-name: arc-runner-kubernetes-ci-arm-large-gha-rs-manager
actions.github.com/runner-group-name: arc-runner-kubernetes-ci-arm-large
actions.github.com/runner-scale-set-name: arc-runner-kubernetes-ci-arm-large
actions.github.com/values-hash: 8b5caae634d958cc7d295b3166c151d036c7896d2b6165bf908a6a24aec5320
meta.helm.sh/release-name: arc-runner-set-kubernetes-arm-large
meta.helm.sh/release-namespace: arc-runners-kubernetes-arm
runner-scale-set-id: 76
API Version: actions.github.com/v1alpha1
Kind: AutoscalingRunnerSet
Metadata:
Creation Timestamp: 2024-07-16T09:49:56Z
Finalizers:
autoscalingrunnerset.actions.github.com/finalizer
Generation: 1
Resource Version: 577760766
UID: 165f74f7-875c-4b8f-a214-96948ec38467
Spec:
Github Config Secret: github-token
Github Config URL: https://github.com/curefit
Listener Template:
Spec:
Containers:
Name: listener
Resources:
Limits:
Cpu: 500m
Memory: 500Mi
Requests:
Cpu: 250m
Memory: 250Mi
Node Selector:
Purpose: github-actions
Tolerations:
Effect: NoSchedule
Key: purpose
Operator: Equal
Value: github-actions
Min Runners: 2
Runner Group: arc-runner-kubernetes-ci-arm-large
Runner Scale Set Name: arc-runner-kubernetes-ci-arm-large
Template:
Spec:
Containers:
Command:
/home/runner/run.sh
Env:
Name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
Value: false
Name: ACTIONS_RUNNER_CONTAINER_HOOKS
Value: /home/runner/k8s/index.js
Name: ACTIONS_RUNNER_POD_NAME
Value From:
Field Ref:
Field Path: metadata.name
Image: ghcr.io/actions/actions-runner:latest
Name: runner
Volume Mounts:
Mount Path: /home/runner/_work
Name: work
Node Selector:
Purpose: github-actions
Restart Policy: Never
Security Context:
Fs Group: 1001
Service Account Name: arc-runner-kubernetes-ci-arm-large-gha-rs-kube-mode
Tolerations:
Effect: NoSchedule
Key: purpose
Operator: Equal
Value: github-actions
Volumes:
Ephemeral:
Volume Claim Template:
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 5Gi
Storage Class Name: gp3
Name: work
Status:
Current Runners: 2
Pending Ephemeral Runners: 2
Events: <none>
Below is the configmap file which is being used:
apiVersion: v1
kind: ConfigMap
metadata:
name: hook-extension
namespace: arc-runners-kubernetes-arm
data:
content: |
spec:
containers:
- name: "$job"
resources:
limits:
cpu: "3000m"
requests:
cpu: "3000m"
expected behavior: The ENV variable ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE getting added along with the volume mounts along the pods which will come up.
Hey @kanakaraju17 ,
After 2 days of trail and error I managed to get a working scenario with resource limits applied. Funny thing is we were overcomplicating it using the "hook-exensions". All we need to is add it in the template.spec.containers[0].resources.requests/limits section.
Below is a snippet of the values to pass into Helm (although I am using a HelmRelease with FluxCD, the principle still applies):
values:
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "standard"
resources:
requests:
storage: 10Gi
githubConfigSecret: gh-secret
githubConfigUrl : "https://github.com/<Organisation>"
runnerGroup: "k8s-nonprod"
runnerScaleSetName: "self-hosted-k8s" # used as a runner label
minRunners: 1
maxRunners: 10
template:
spec:
securityContext:
fsGroup: 1001
imagePullSecrets:
- name: cr-secret
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
limits:
cpu: "2000m"
memory: "5Gi"
requests:
cpu: "200m"
memory: "512Mi"
I have confirmed that this has been working for me with some CodeQL workflows failing due to "insufficient RAM" lol.
Hope it helps.
@marcomarques-bt, I assume that the above configuration works only for runner pods and not the pods where the workflow runs i.e. the workflow pods. The above only works for runner pods.
Refer to the image below, the configuration works for the first pod and not the second pod where the actual job runs.
It seems that, similar to the issue mentioned earlier, toleration cannot be configured either.
:wave: Hey, thanks for opening this topic.
I have managed to get this going but we have some large runners and we ran into an issue where if there are not resources available on the node the workflow pod fails to schedule...
Error: Error: pod failed to come online with error: Error: Pod lendable-large-x64-linux-dev-h8727-runner-thwgr-workflow is unhealthy with phase status Failed
and it needs to be scheduled on the same node as the runner because of the pvc. This whole thing doesn't make much sense. We want people to specify for example a large runner in kubernetes mode and at the end they we get an idle pod that just tries to spin up a new pod.
@kanakaraju17 thanks for opening this issue. Did you ever find a mechanism to enforce resource limits?
for @cboettig and those following this thread and the interaction between @kanakaraju17 and @georgblumenschein, I have made it work with the following configuration, I am sharing it as json as it's more clear that the configmap is properly formatted:
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: hook-extension
data:
content: |
{
"spec": {
"containers": [
{
"name": "$job",
"resources": {
"requests": {
"cpu": "1000m",
"memory": "1Gi"
},
"limits": {
"cpu": "2000m",
"memory": "2Gi"
}
}
}
]
}
}
runner-scaleset values:
template:
spec:
containers:
- name: runner
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-template/content
volumeMounts:
- name: pod-template
mountPath: /home/runner/pod-template
readOnly: true
volumes:
- name: pod-template
configMap:
name: hook-extension
This will add the resource requests and limits only for the workflow pods, without wasting resources on runner pods.
That is all really good but the moment you set resource on the workflow pod and there is no space on the node hosting the controller pod you are out of luck... It won't wait for available resource but just fail. We are in the process of evaluating the option to use kube scheduler but that requires changing the PVC to RWM which is expensive and has its limitations. We are in AWS and have tried EFS and IO2 but both don't work well.
GitHub should really implement this properly as it is really handicapped at the moment.
@velkovb you are right, by setting requests on the workflow pod but not on the controller pod, we quickly ran into that issue: the controller pod always has room in the node, but the whole action fails if there's no room for its corresponding workflow pod.
So far we have worked around it by assigning requests on the controller pod, and none on the workflow one. That way the workflow pod always has room, and we count on it cannibalizing the resources assigned to the controller pod, since the controller is very lightweight. This is not an ideal solution but the best we can come up without RWX.
What issues have you experienced with IO2? That was my next alternative to try, so we can use kube scheduler and not worry about controller and workflow pods having to land on the same node.
@sqr
So far we have worked around it by assigning requests on the controller pod, and none on the workflow one. That way the workflow pod always has room, and we count on it cannibalizing the resources assigned to the controller pod, since the controller is very lightweight. This is not an ideal solution but the best we can come up without RWX.
I don't think I get how that works. If you set requests for the controller pod, won't it actually reserve it for that pod and not give it to anything else? I would see it work for CPU but not sure it does for memory?
What issues have you experienced with IO2? That was my next alternative to try, so we can use kube scheduler and not worry about controller and workflow pods having to land on the same node.
Multi-attach works in block mode and the volumeMounts that the hooks do for the workflow pods do a lot of path mapping that would be hard to replicate with no guaranteed result and you still have the AZ restriction as EBS is zonal. EFS was just slow.
@velkovb
I don't think I get how that works. If you set requests for the controller pod, won't it actually reserve it for that pod and not give it to anything else? I would see it work for CPU but not sure it does for memory?
The requests guarantee that the specified amount of cpu is available at scheduling time, but if the workflow pod requires cpu time and the controller is idle, it will take it from it. This is not the case for memory, which is why I have only set requests for cpu
@velkovb We've migrated to a RWX setup with a NFS CSI storage class to avoid the multi-attach error of RWO - however we're experiencing slowness with workflow pods being provisioned (usually takes 3 minutes per github action job).
I suspect it has something to do with FS slowness (not sure if its provisioning, or just using it in general). Do you have any recommendations?
We've opened a ticket here https://github.com/actions/runner-container-hooks/issues/207
My findings were that the slowness was in the pre-setup function while it is copying the workspace - https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/hooks/prepare-job.ts#L184
The first log message I see after the container starts is - https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/hooks/prepare-job.ts#L45.
The slowness is not in PVC provisioning as that goes really fast. That workspace seems to be only ~250MB so not sure why it is so slow.
This is also bothering me—how something seemingly so basic of a requirement isn’t a standard option out of the box. I’m considering trying the following approaches to ensure the workflow pod will fit (resource-wise) and be scheduled onto the same node:
- Use CPU requests on the controller pod combined with affinity rules, and memory requests on the workflow pod, to ensure it gets scheduled on the same node.
- Maintain consistent resource ratios. For example, if your nodes are 8 cores x 32 GB of memory, ensure a 1:4 CPU-to-memory ratio. This means increasing CPU allocations proportionally to memory, even if the CPU isn’t fully needed. For instance, if a workload requires only 1 core and 16 GB of memory, specify 4 cores to match the ratio. While this approach can be wasteful, you could configure Karpenter with appropriate node sizes that align with these ratios.
To address different workload needs, I’m planning to define multiple runner scale sets with varying sizes, allowing developers to select the one that fits their requirements:
- General Purpose (gp):
- x-small-gp: 500m, 2 GB memory
- small-gp: 1 core, 4 GB memory
- medium-gp: 2 cores, 8 GB memory
- High Memory (hm):
- x-small-hm: 500m, 4 GB memory
- small-hm: 1 core, 8 GB memory
- medium-hm: 2 cores, 16 GB memory
- High CPU (hc):
- x-small-hc: 500m, 1 GB memory
- small-hc: 1 core, 2 GB memory
- medium-hc: 2 cores, 4 GB memory
Each class maps to specific instance types:
- gp: General purpose instances (e.g., m7i.xlarge)
- hm: Memory-optimized instances (e.g., r7i.xlarge)
- hc: Compute-optimized instances (e.g., c7i.xlarge)
Has anyone else approached this problem in a similar way? If so, I’d love to hear any pointers or lessons learned. Also, if anyone sees potential holes in my plan or areas for improvement, please let me know!
@jasonwbarnett We started with similar more granular approach to resource ratios but noticed that it was not followed strictly and actually kubernetes nodes had a lot of free resources. Besides what you have in mind we had further break down for dind and arm runners so that ended up with a lot of different scalesets (more than 30). At the end we did more of a T-shirt sizing approach which have just small, medium and large (both x64 and arm64) and options with and without dind. No resource limits are set on the pods so they can utilize whatever is free on the nodes. We didn't want to separate them on different node pools are that will decrease the efficiency and we are running one large spot node pool.
The idea with having CPU on the controller and memory on the workflow pod is a good workaround :)
@velkovb Thanks for sharing your experience! It’s interesting to hear that a more granular approach led to inefficiencies due to underutilized resources. I can definitely see how managing over 30 scale sets could become unwieldy. Your T-shirt sizing approach with small, medium, and large options (and differentiating x64/arm64 and dind/non-dind) sounds like a practical way to simplify things while still offering flexibility. Did you have the scalesets mapped to specific karpenter nodepools or what?
I hadn’t considered not setting resource limits on the pods to allow them to utilize free node resources—that’s an intriguing idea. I imagine it works well with your setup of a single large spot node pool, as you can maximize utilization without worrying too much about strict separation.
Thanks for the feedback on the controller CPU and workflow pod memory approach! I’ll experiment further with these ideas and keep the potential for over-complication in mind. If you don’t mind me asking, how do you handle scaling with the T-shirt sizes—do you find it works well with just the large node pool, or are there edge cases where it gets tricky?
@jasonwbarnett Just one large node spot node pool and we have an on-demand one for edge cases of really long running jobs. We were monitoring our nodes and the resource usage was rarely going above 30% and that made us try no resource limits. For workloads we always set memory request = memory limit but here due to the short lifetime of a job pod we believe it wouldn't be a problem. We run roughly 20k jobs a day and so far it seems to be working fine :) In our config we have some warm runners for the most used types and we use overprovisioning to keep 1-2 warm nodes (as that is usually the slowest thing).
Closing this issue since it is not related to ARC, and this comment describes the intended way to set requests/limits for workflow pods. Thank you for providing a solution!
Closing this issue since it is not related to ARC, and https://github.com/actions/actions-runner-controller/issues/3641#issuecomment-2574223158 comment describes the intended way to set requests/limits for workflow pods. Thank you for providing a solution!
@nikola-jokic I would argue that the issue should stay open. There is still no way to set workflow pods resources. This is just a hacky workaround to load a configuration file that is in no way dynamic. The final solution should allow us to set the workflow pod in the actual GitHub workflow (or get rid of this workflow pod non-sense in general :) ).
Hey @velkovb,
I completely agree with you here, there is currently no way we can specify the workflow pod requests/limits. However, ARC is not responsible for it. The container hook is. ARC doesn't have any special handling of the runner configured in containerMode kubernetes. It simply provides the out-of-box spec that can be used to configure the runner with the container hook.
Can you please help me understand what do you mean to get rid of the workflow pod. The workflow pod is the host container where the hook execs into and runs the commands. That was the initial purpose of the kubernetes mode, to avoid running the workflow container using dind. If you want to run dind, or configure it using rootless, you certainly can do that. But the workflow pod is definitely a requirement for k8s mode, so I don't see how we can get rid of it.
To provide more context, to set requests/limits using a workflow, we would have to either extend the workflow syntax to support these cases, or use subset of docker options to translate them to the workflow spec. The first approach requires touching multiple parts of the system, and should be planned for and prioritized. The second approach is kind of a hack. We only need a subset of options to specify requests and limits. The container hook extension was the quick approach (and definitely not a complete approach I admit that) to limit workflow pods on your cluster, and to provide a way to configure service account for it, etc.
To re-iterate, container mode issues should not be opened in this repo, but rather in the runner-container-hooks repo. Feature requests and feedback should always be submitted on the Community Support Forum. Please submit your feedback there and thank you for participating in these discussions!
I meant that the way Kubernetes mode works for Github actions doesn't feel natural. I would expect when I specify a container for my job that it should directly run inside that container and wouldn't expect that there would be a middle man (a controller pod that spins up a workflows pod...) . Other CI systems support that and it makes things a lot more simpler.
P.S. Furthermore, the comments like this is not an ARC issue increase the feeling that even though GitHub adopted the project it is not a first class citizen. Issues are being closed, PRs are not merged, the general impression is not good.
But this is the issue board for ARC, it doesn't make sense to me to keep issues related to other projects here. The container hook is a separate project. The runner is also a separate project. The responsibility of ARC is to spin up runner images and scale based on the demand. Let's say there is a runner issue. It should be submitted to the runner repo. It doesn't matter if the ARC is the one spinning up these runners. If this were the container hook issue, I would transfer it to the container hook repo. But this is an enhancement, so I pointed out where it should be submitted.
I fully back what @velkovb is saying.
It's extremely inconvenient to setup ARC for an organization where each repository has their own CI jobs with custom podSpec and most importantly CPU/memoryrequirements for each of their CI job.
Not to mention that CPU/memory requirements can change within a Pull Request, ie new heavy tests are added or a big build step and what are you gonna do with ARC?
- Ask k8s devops team in the Org to add another type of runner small/medium/etc?
- ARC CI configuration breaks Git Ops — it's detached from the actual Repo that uses it which violates so many Devops principles.
Many other CI systems support per-job podSpec for their K8S controllers without issues:
- https://buildkite.com/docs/agent/v3/agent-stack-k8s/podspec
- https://github.com/EmbarkStudios/k8s-buildkite-plugin (I actually implemented this one in 300 lines of Bash with podSpecPatch per Job!)
- https://www.jenkins.io/doc/pipeline/steps/kubernetes/#podtemplate-define-a-podtemplate-to-use-in-the-kubernetes-plugin
- https://tekton.dev/vault/pipelines-main/compute-resources/?utm_source=chatgpt.com#configure-task-level-compute-resources
- https://argo-workflows.readthedocs.io/en/latest/fields/
Just to list a few.
Instead of maintaining "Pools" of runners, their controllers simply schedule K8S pods/jobs per job with binding parameters so that the CI agent only picks up the required job.
I think per-job podSpec patch must be considered as part of ARC architecture for actually native K8S integration.