actions-runner-controller
actions-runner-controller copied to clipboard
Runner pods keeps in loop after scaling by webhook: Listening to jobs -> Access denied error -> Terminates
Controller Version
0.25.2
Helm Chart Version
0.20.2
CertManager Version
No response
Deployment Method
Helm
cert-manager installation
yes
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: org-runner-2x
namespace: gh-actions
spec:
replicas: 2
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
env:
- name: DISABLE_RUNNER_UPDATE
value: "true"
terminationGracePeriodSeconds: 3600
ephemeral: false
organization: x
dockerEnabled: true
dockerMTU: 1460
labels:
- "label-2x"
- "label-default"
resources:
requests:
cpu: 3
memory: 8000Mi
limits:
cpu: 6
memory: 8000Mi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: organization-runner-2x
namespace: gh-actions
spec:
maxReplicas: 15
minReplicas: 2
scaleDownDelaySecondsAfterScaleOut: 900
scaleTargetRef:
name: org-runner-2x
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: "10m"
To Reproduce
1. Let the controller scale out some runners by webhook. (HRA max replicas will increase)
2. Runners will keep restarting until HRA's max replicas decrease to normal value
Describe the bug
Runners keep restarting after a webhook-based scale happens until HRA decides to set desired replica count to the normal value.
Usually, after the first "Listening for Jobs" long polling, an error is shown: An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.
After the error, the runner starts to shutdown.
If HRA is deleted and the runners count is set to the same max replicas value, the runners stop restarting.
Am I missing something? Thanks in advance!
Describe the expected behavior
As per I understand, the runners should not keep restarting unless a job is finished or in case the "scaleDownDelaySecondsAfterScaleOut" time has passed, and the HRA decides to decrease replica count.
Controller Logs
https://gist.github.com/fernandonogueira/a57ef7731925808b825cac671f017499
Runner Pod Logs
https://gist.github.com/fernandonogueira/891ea9bfbae68b33c07ead0af3bbaa7c
https://gist.github.com/fernandonogueira/e64debbbbc0f2fa18d812bc399fdddaf
Additional Context
There are 2 other organization RunnerDeployments with their HRA configurations. Right now I'm not able to reproduce the same behavior with those 2 other kinds of runners.
@fernandonogueira Hey!
An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.
This error comes from https://github.com/actions/runner, which is not part of ARC.
As it did say √ Connected to GitHub
, I'm sure that ARC correctly registered the runner and passed the runner registration token to the runner, hence actions/runner
was able to register itself successfully. Everything after that happens in GitHub Actions and actions/runner
, which ARC has no control over.
You'd need to ask GitHub instead!
Hey, @mumoshu !
Thanks for the quick reply. Ok, then. :(
Are the needed GitHub App permissions described in the documentation up-to-date? This error made me think if I missed something.
But the other runner deployments are working fine. Thanks! \0
Edit: Also, just by removing HRA, the runners stop to shut down. They have been running for 2 days now without problems.
@fernandonogueira Hey! Thanks for confirming. To be extra sure, HRA doesn't chime in into the runner registration and configuration process at all, so the existence and the availability of HRA would never affect the functionality of runners. If it has been fixed on GitHub side somehow, you might now be able to make HRA working without changing your config?
Hey!!! :))
Update: I found this today in the readme:
Important!!! If you opt to configure autoscaling, ensure you remove the replicas: attribute in the RunnerDeployment / RunnerSet kinds that are configured for autoscaling
That was the root cause of this. :) Thank you, guys! 🙇🏼
Hey!!! :))
Update: I found this today in the readme:
Important!!! If you opt to configure autoscaling, ensure you remove the replicas: attribute in the RunnerDeployment / RunnerSet kinds that are configured for autoscaling
That was the root cause of this. :) Thank you, guys! 🙇🏼
Very hard to debug but this was my issue too. Many thanks.
Worth noting that even though it is no longer in a crash loop ArgoCD still reports the runner pod as being perpetually "progressing".