actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

horizontalrunnerautoscaler Detected job with no labels, which is not supported by ARC. Skipping anyway

Open mattpopa opened this issue 2 years ago • 3 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

v0.27.4

Helm Chart Version

0.23.3

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

yes, this is the cert manager has been installed using

helm upgrade --install cert-manager jetstack/cert-manager \                                                                                                         
--namespace cert-manager \
--create-namespace \
--version v1.11.0 \
--set installCRDs=true --wait

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: self-hosted-large
  namespace: actions-runner-system
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      serviceAccountName: github-actions-sa
      securityContext:
        # For Ubuntu 20.04 runner
        fsGroup: 1000
      organization: my-org
      image: summerwind/actions-runner-dind:latest
      imagePullPolicy: IfNotPresent
      ephemeral: true
      dockerEnabled: false
      dockerdWithinRunnerContainer: true
      containers:
      - name: runner
        resources:
          requests:
            memory: "10Gi"
            cpu: "3000m"
          limits:
            memory: "10Gi"
            cpu: "3000m"
      labels:
        - large
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: actions-runner-system
  name: self-hosted-large
spec:
  scaleDownDelaySecondsAfterScaleOut: 10
  scaleTargetRef:
    kind: RunnerDeployment
    name: self-hosted-large
  minReplicas: 0
  maxReplicas: 6
  metrics:
    - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
      repositoryNames:
        - frontend

To Reproduce

this happens randomly, and the jobs have labels using this format:

runs-on: [self-hosted, large]

https://github.com/actions/actions-runner-controller/blob/032443fcfd4cf7b6e8bb09ed9dca639bcba9f8a4/controllers/actions.summerwind.net/autoscaling.go#L153



### Describe the bug

Randomly, the `horizontalrunnerautoscaler` doesn't update the desired replicas and the job waits indefinitely in github:

Requested labels: self-hosted, large Job defined at: my-org/frontend/.github/workflows/zcommon_web_e2e_tests.yml@refs/heads/master Reusable workflow chain: my-org/frontend/.github/workflows/web_scheduled_e2e.yml@refs/heads/master (a9790cfa59ca77ead2f8ec4987a9cac8e98cfcce) -> my-org/frontend/.github/workflows/zcommon_web_e2e_tests.yml@refs/heads/master (a9790cfa59ca77ead2f8ec4987a9cac8e98cfcce) Waiting for a runner to pick up this job...

and the job uses the following label format

runs-on: [self-hosted, large]


should there be any dif between setting labels within quotes for the `horizontalrunnerautoscaler`?

runs-on: [self-hosted, large]


vs

runs-on: ["self-hosted", "large"]

?

any suggestion on how to further debug this?



### Describe the expected behavior

we shouldn't see this error in the ARC logs


### Whole Controller Logs

```shell
2023-05-22T10:02:22Z	INFO	horizontalrunnerautoscaler	Detected job with no labels, which is not supported by ARC. Skipping anyway.	{"labels": [], "run_id": 5044287443, "job_id": 13654547143}


### Whole Runner Pod Logs

```shell
there are no runner logs available

Additional Context

there is no runner in pending state, there are avialble resources on the node(s).

mattpopa avatar May 22 '23 13:05 mattpopa

I have the same issue and I don't understand how to use TotalNumberOfQueuedAndInProgressWorkflowRuns

rtsisyk avatar Nov 16 '23 14:11 rtsisyk