actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Cannot scale from zero with TotalNumberOfQueuedAndInProgressWorkflowRuns metric

Open gionn opened this issue 1 year ago • 5 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.5

Helm Chart Version

0.23.4

CertManager Version

1.12.1

Deployment Method

Helm

cert-manager installation

Helm install via official charts cert-manager

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  annotations:
  creationTimestamp: "2023-08-18T07:27:45Z"
  generation: 229
  name: acs-deployment
  namespace: default
  resourceVersion: "5726615"
  uid: 5bb3adab-71f2-4e38-980d-e606438f5822
spec:
  effectiveTime: null
  replicas: 1
  selector: null
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      dockerdContainerResources: {}
      dockerdWithinRunnerContainer: true
      image: summerwind/actions-runner-dind:ubuntu-22.04
      repository: Alfresco/acs-deployment
      resources:
        limits:
          cpu: 1750m
          memory: 7Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
  creationTimestamp: "2023-08-24T10:27:08Z"
  generation: 13
  name: acs-deployment-autoscaler
  namespace: default
  resourceVersion: "5727811"
  uid: bd8c52a3-7b47-4eff-b75d-0ea820615d60
spec:
  maxReplicas: 20
  metrics:
  - scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 5
    scaleUpThreshold: "0.75"
    type: PercentageRunnersBusy
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
  minReplicas: 0
  scaleTargetRef:
    kind: RunnerDeployment
    name: acs-deployment
status:
  desiredReplicas: 11
  lastSuccessfulScaleOutTime: "2023-08-29T13:16:04Z"

To Reproduce

1. Wait for no jobs running so runner deployment get scaled to zero
2. Trigger new workflows

Describe the bug

When zero replicas are currently active, autoscaling is not triggered because TotalNumberOfQueuedAndInProgressWorkflowRuns is always reporting zero pending jobs, despite there are queued workflows waiting to be picked up by a runner.

When this happens, controller clearly show that no workflows are queued for it:

2023-08-29T13:11:32Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 0 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "acs-deployment", "horizontal_runner_autoscaler": "acs-deployment-autoscaler"}

Describe the expected behavior

In a few minutes new replicas should started to begin executing the pending workflows

Whole Controller Logs

https://gist.github.com/gionn/d6abb20e8ce463a2978bc6a549531400

Whole Runner Pod Logs

n/a

Additional Context

No response

gionn avatar Aug 29 '23 13:08 gionn

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Aug 29 '23 13:08 github-actions[bot]

For what I understood, this:

"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0

probably means that ListRepositoryWorkflowRuns here is not returning anything.

I see that those endpoints requires repo privileges, I am using a PAT with an user that has them.

I already tried specifying repositoryNames for the TotalNumberOfQueuedAndInProgressWorkflowRuns metric or not like above, given that I am using RunnerDeployment bound to a specific repository.

any other idea to further debug?

I am afraid that without introducing some additional debug it's hard to understand what is going on here.

gionn avatar Aug 30 '23 07:08 gionn

@gionn I just looked at the gist/log attachment and it appears as though workers are scaling up from my point of view? We can see the ARC controller attempting to generate these new pods and we see k8s attempting to schedule them on nodes according to the gist/log attachment. For instance, within this message Skipped reconcilation because owner is not synced yet towards the bottom of that structured log entry we see this:

"status":{"phase":"Pending","conditions":[{"type":"PodScheduled","status":"False","lastProbeTime":null,"lastTransitionTime":"2023-08-29T13:16:05Z","reason":"Unschedulable","message":"0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.."}],"qosClass":"Guaranteed"}}]}
2023-08-29T13:17:02Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 11 by PercentageRunnersBusy	{"replicas_desired_before": 11, "replicas_desired": 11, "num_runners": 11, "num_runners_registered": 6, "num_runners_busy": 6, "num_terminating_busy": 0, "namespace": "default", "kind": "runnerdeployment", "name": "acs-deployment", "horizontal_runner_autoscaler": "acs-deployment-autoscaler", "enterprise": "", "organization": "", "repository": "Alfresco/acs-deployment"}

We can also see this auto-scaling behavior happening in the HorizontalRunnerAutoscaler definition you've pasted above in this area:

status:
  desiredReplicas: 11
  lastSuccessfulScaleOutTime: "2023-08-29T13:16:04Z"

If you are having trouble with a particular repo's Actions jobs not being executed I would try to validate you can make a cURL request to the GitHub API with the PAT you've generated to see if any jobs are actually showing up as pending.

Apologies if I've misunderstood what the issue is!

kevholmes avatar Aug 30 '23 16:08 kevholmes

yeah when attaching the logs I was thinking that maybe it could cause some confusion because I had switched from minReplicas: 0 to minReplicas: 1, like in https://gist.github.com/gionn/d6abb20e8ce463a2978bc6a549531400#file-gistfile1-txt-L128 where desired get raised to 1 because of that.

But if I am not wrong, I see metrics are reported properly only with PercentageRunnersBusy, which works fine unless when desiredReplicas is zero (and that's documented and I understand why - there are no busy runners when there are zero runners 😬)

For this reason I see I could add TotalNumberOfQueuedAndInProgressWorkflowRuns as a secondary metric, but that metric when used is always reporting zero workflows even when there are jobs pending for a runners, so I get stuck with zero runners forever.

The only way to recover from this is to set minReplicas: 1 so PercentageRunnersBusy can do its job.

I tried playing with workflow APIs and results always seems consistent, and please note that I am working on public repo and that doesn't even require a PAT to call that API.

Maybe I can try adding/configuring some requests logging in the github api client, if you have some pointers it's highly appreciated because I am totally not fluent with Go.

gionn avatar Aug 30 '23 16:08 gionn

I am facing the same issue using the latest version of actions-runner-controller (0.23.7). changing to minReplicas 0 does not allow the runners to scale using TotalNumberOfQueuedAndInProgressWorkflowRuns. in the log i also see the same: 2024-03-06T18:54:00Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 0 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "devex-infra", "kind": "runnerdeployment", "name": "helm-repo-devex-infra-runners", "horizontal_runner_autoscaler": "helm-repo-devex-infra-runners"}

while i do have a queued workflow in the repository. any suggestion would be appreciated.

mor-benhamo avatar Mar 06 '24 18:03 mor-benhamo