actions-runner-controller
actions-runner-controller copied to clipboard
Cannot scale from zero with TotalNumberOfQueuedAndInProgressWorkflowRuns metric
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.27.5
Helm Chart Version
0.23.4
CertManager Version
1.12.1
Deployment Method
Helm
cert-manager installation
Helm install via official charts cert-manager
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
annotations:
creationTimestamp: "2023-08-18T07:27:45Z"
generation: 229
name: acs-deployment
namespace: default
resourceVersion: "5726615"
uid: 5bb3adab-71f2-4e38-980d-e606438f5822
spec:
effectiveTime: null
replicas: 1
selector: null
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
dockerdContainerResources: {}
dockerdWithinRunnerContainer: true
image: summerwind/actions-runner-dind:ubuntu-22.04
repository: Alfresco/acs-deployment
resources:
limits:
cpu: 1750m
memory: 7Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
annotations:
creationTimestamp: "2023-08-24T10:27:08Z"
generation: 13
name: acs-deployment-autoscaler
namespace: default
resourceVersion: "5727811"
uid: bd8c52a3-7b47-4eff-b75d-0ea820615d60
spec:
maxReplicas: 20
metrics:
- scaleDownAdjustment: 1
scaleDownThreshold: "0.3"
scaleUpAdjustment: 5
scaleUpThreshold: "0.75"
type: PercentageRunnersBusy
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
minReplicas: 0
scaleTargetRef:
kind: RunnerDeployment
name: acs-deployment
status:
desiredReplicas: 11
lastSuccessfulScaleOutTime: "2023-08-29T13:16:04Z"
To Reproduce
1. Wait for no jobs running so runner deployment get scaled to zero
2. Trigger new workflows
Describe the bug
When zero replicas are currently active, autoscaling is not triggered because TotalNumberOfQueuedAndInProgressWorkflowRuns is always reporting zero pending jobs, despite there are queued workflows waiting to be picked up by a runner.
When this happens, controller clearly show that no workflows are queued for it:
2023-08-29T13:11:32Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 0 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "acs-deployment", "horizontal_runner_autoscaler": "acs-deployment-autoscaler"}
Describe the expected behavior
In a few minutes new replicas should started to begin executing the pending workflows
Whole Controller Logs
https://gist.github.com/gionn/d6abb20e8ce463a2978bc6a549531400
Whole Runner Pod Logs
n/a
Additional Context
No response
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
For what I understood, this:
"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0
probably means that ListRepositoryWorkflowRuns
here is not returning anything.
I see that those endpoints requires repo
privileges, I am using a PAT with an user that has them.
I already tried specifying repositoryNames
for the TotalNumberOfQueuedAndInProgressWorkflowRuns
metric or not like above, given that I am using RunnerDeployment bound to a specific repository.
any other idea to further debug?
I am afraid that without introducing some additional debug it's hard to understand what is going on here.
@gionn I just looked at the gist/log attachment and it appears as though workers are scaling up from my point of view? We can see the ARC controller attempting to generate these new pods and we see k8s attempting to schedule them on nodes according to the gist/log attachment. For instance, within this message Skipped reconcilation because owner is not synced yet
towards the bottom of that structured log entry we see this:
"status":{"phase":"Pending","conditions":[{"type":"PodScheduled","status":"False","lastProbeTime":null,"lastTransitionTime":"2023-08-29T13:16:05Z","reason":"Unschedulable","message":"0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.."}],"qosClass":"Guaranteed"}}]}
2023-08-29T13:17:02Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 11 by PercentageRunnersBusy {"replicas_desired_before": 11, "replicas_desired": 11, "num_runners": 11, "num_runners_registered": 6, "num_runners_busy": 6, "num_terminating_busy": 0, "namespace": "default", "kind": "runnerdeployment", "name": "acs-deployment", "horizontal_runner_autoscaler": "acs-deployment-autoscaler", "enterprise": "", "organization": "", "repository": "Alfresco/acs-deployment"}
We can also see this auto-scaling behavior happening in the HorizontalRunnerAutoscaler
definition you've pasted above in this area:
status:
desiredReplicas: 11
lastSuccessfulScaleOutTime: "2023-08-29T13:16:04Z"
If you are having trouble with a particular repo's Actions jobs not being executed I would try to validate you can make a cURL request to the GitHub API with the PAT you've generated to see if any jobs are actually showing up as pending.
Apologies if I've misunderstood what the issue is!
yeah when attaching the logs I was thinking that maybe it could cause some confusion because I had switched from minReplicas: 0
to minReplicas: 1
, like in https://gist.github.com/gionn/d6abb20e8ce463a2978bc6a549531400#file-gistfile1-txt-L128
where desired get raised to 1 because of that.
But if I am not wrong, I see metrics are reported properly only with PercentageRunnersBusy
, which works fine unless when desiredReplicas
is zero (and that's documented and I understand why - there are no busy runners when there are zero runners 😬)
For this reason I see I could add TotalNumberOfQueuedAndInProgressWorkflowRuns
as a secondary metric, but that metric when used is always reporting zero workflows even when there are jobs pending for a runners, so I get stuck with zero runners forever.
The only way to recover from this is to set minReplicas: 1
so PercentageRunnersBusy
can do its job.
I tried playing with workflow APIs and results always seems consistent, and please note that I am working on public repo and that doesn't even require a PAT to call that API.
Maybe I can try adding/configuring some requests logging in the github api client, if you have some pointers it's highly appreciated because I am totally not fluent with Go.
I am facing the same issue using the latest version of actions-runner-controller (0.23.7). changing to minReplicas 0 does not allow the runners to scale using TotalNumberOfQueuedAndInProgressWorkflowRuns. in the log i also see the same: 2024-03-06T18:54:00Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 0 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "devex-infra", "kind": "runnerdeployment", "name": "helm-repo-devex-infra-runners", "horizontal_runner_autoscaler": "helm-repo-devex-infra-runners"}
while i do have a queued workflow in the repository. any suggestion would be appreciated.