actions-runner-controller
actions-runner-controller copied to clipboard
horizontalrunnerautoscaler Detected job with no labels, which is not supported by ARC. Skipping anyway
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
v0.27.4
Helm Chart Version
0.23.3
CertManager Version
No response
Deployment Method
Helm
cert-manager installation
yes, this is the cert manager has been installed using
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.11.0 \
--set installCRDs=true --wait
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: self-hosted-large
namespace: actions-runner-system
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
serviceAccountName: github-actions-sa
securityContext:
# For Ubuntu 20.04 runner
fsGroup: 1000
organization: my-org
image: summerwind/actions-runner-dind:latest
imagePullPolicy: IfNotPresent
ephemeral: true
dockerEnabled: false
dockerdWithinRunnerContainer: true
containers:
- name: runner
resources:
requests:
memory: "10Gi"
cpu: "3000m"
limits:
memory: "10Gi"
cpu: "3000m"
labels:
- large
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
namespace: actions-runner-system
name: self-hosted-large
spec:
scaleDownDelaySecondsAfterScaleOut: 10
scaleTargetRef:
kind: RunnerDeployment
name: self-hosted-large
minReplicas: 0
maxReplicas: 6
metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- frontend
To Reproduce
this happens randomly, and the jobs have labels using this format:
runs-on: [self-hosted, large]
https://github.com/actions/actions-runner-controller/blob/032443fcfd4cf7b6e8bb09ed9dca639bcba9f8a4/controllers/actions.summerwind.net/autoscaling.go#L153
### Describe the bug
Randomly, the `horizontalrunnerautoscaler` doesn't update the desired replicas and the job waits indefinitely in github:
Requested labels: self-hosted, large Job defined at: my-org/frontend/.github/workflows/zcommon_web_e2e_tests.yml@refs/heads/master Reusable workflow chain: my-org/frontend/.github/workflows/web_scheduled_e2e.yml@refs/heads/master (a9790cfa59ca77ead2f8ec4987a9cac8e98cfcce) -> my-org/frontend/.github/workflows/zcommon_web_e2e_tests.yml@refs/heads/master (a9790cfa59ca77ead2f8ec4987a9cac8e98cfcce) Waiting for a runner to pick up this job...
and the job uses the following label format
runs-on: [self-hosted, large]
should there be any dif between setting labels within quotes for the `horizontalrunnerautoscaler`?
runs-on: [self-hosted, large]
vs
runs-on: ["self-hosted", "large"]
?
any suggestion on how to further debug this?
### Describe the expected behavior
we shouldn't see this error in the ARC logs
### Whole Controller Logs
```shell
2023-05-22T10:02:22Z INFO horizontalrunnerautoscaler Detected job with no labels, which is not supported by ARC. Skipping anyway. {"labels": [], "run_id": 5044287443, "job_id": 13654547143}
### Whole Runner Pod Logs
```shell
there are no runner logs available
Additional Context
there is no runner in pending state, there are avialble resources on the node(s).
I have the same issue and I don't understand how to use TotalNumberOfQueuedAndInProgressWorkflowRuns