actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Runner pods keeps in loop after scaling by webhook: Listening to jobs -> Access denied error -> Terminates

Open fernandonogueira opened this issue 2 years ago • 3 comments

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

yes

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: org-runner-2x
  namespace: gh-actions
spec:
  replicas: 2
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      env:
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
      terminationGracePeriodSeconds: 3600
      ephemeral: false
      organization: x
      dockerEnabled: true
      dockerMTU: 1460
      labels:
        - "label-2x"
        - "label-default"
      resources:
        requests:
          cpu: 3
          memory: 8000Mi
        limits:
          cpu: 6
          memory: 8000Mi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: organization-runner-2x
  namespace: gh-actions
spec:
  maxReplicas: 15
  minReplicas: 2
  scaleDownDelaySecondsAfterScaleOut: 900
  scaleTargetRef:
    name: org-runner-2x
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: "10m"

To Reproduce

1. Let the controller scale out some runners by webhook. (HRA max replicas will increase)
2. Runners will keep restarting until HRA's max replicas decrease to normal value

Describe the bug

Runners keep restarting after a webhook-based scale happens until HRA decides to set desired replica count to the normal value.

Usually, after the first "Listening for Jobs" long polling, an error is shown: An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action. After the error, the runner starts to shutdown.

If HRA is deleted and the runners count is set to the same max replicas value, the runners stop restarting.

Am I missing something? Thanks in advance!

Describe the expected behavior

As per I understand, the runners should not keep restarting unless a job is finished or in case the "scaleDownDelaySecondsAfterScaleOut" time has passed, and the HRA decides to decrease replica count.

Controller Logs

https://gist.github.com/fernandonogueira/a57ef7731925808b825cac671f017499

Runner Pod Logs

https://gist.github.com/fernandonogueira/891ea9bfbae68b33c07ead0af3bbaa7c
https://gist.github.com/fernandonogueira/e64debbbbc0f2fa18d812bc399fdddaf

Additional Context

There are 2 other organization RunnerDeployments with their HRA configurations. Right now I'm not able to reproduce the same behavior with those 2 other kinds of runners.

fernandonogueira avatar Jul 19 '22 22:07 fernandonogueira

@fernandonogueira Hey!

An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.

This error comes from https://github.com/actions/runner, which is not part of ARC. As it did say √ Connected to GitHub, I'm sure that ARC correctly registered the runner and passed the runner registration token to the runner, hence actions/runner was able to register itself successfully. Everything after that happens in GitHub Actions and actions/runner, which ARC has no control over. You'd need to ask GitHub instead!

mumoshu avatar Jul 20 '22 05:07 mumoshu

Hey, @mumoshu !

Thanks for the quick reply. Ok, then. :(

Are the needed GitHub App permissions described in the documentation up-to-date? This error made me think if I missed something.

But the other runner deployments are working fine. Thanks! \0

Edit: Also, just by removing HRA, the runners stop to shut down. They have been running for 2 days now without problems.

fernandonogueira avatar Jul 21 '22 20:07 fernandonogueira

@fernandonogueira Hey! Thanks for confirming. To be extra sure, HRA doesn't chime in into the runner registration and configuration process at all, so the existence and the availability of HRA would never affect the functionality of runners. If it has been fixed on GitHub side somehow, you might now be able to make HRA working without changing your config?

mumoshu avatar Jul 25 '22 07:07 mumoshu

Hey!!! :))

Update: I found this today in the readme:

Important!!! If you opt to configure autoscaling, ensure you remove the replicas: attribute in the RunnerDeployment / RunnerSet kinds that are configured for autoscaling

That was the root cause of this. :) Thank you, guys! 🙇🏼

fernandonogueira avatar Aug 18 '22 04:08 fernandonogueira

Hey!!! :))

Update: I found this today in the readme:

Important!!! If you opt to configure autoscaling, ensure you remove the replicas: attribute in the RunnerDeployment / RunnerSet kinds that are configured for autoscaling

That was the root cause of this. :) Thank you, guys! 🙇🏼

Very hard to debug but this was my issue too. Many thanks.

Worth noting that even though it is no longer in a crash loop ArgoCD still reports the runner pod as being perpetually "progressing".

JossWhittle avatar Apr 27 '23 12:04 JossWhittle