actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

HRA ignores webhook repository scale up events (but acks scaleDown)

Open kc-sn opened this issue 3 years ago • 6 comments

Describe the bug It appears that under certain conditions, HRAs configured against a RunnerDeployment repository is not recognized on scaleUp events (but is recognized for scaleDown events)

Checks

  • [x] My actions-runner-controller version (v0.x.y) does support the feature
  • [ ] I'm using an unreleased version of the controller I built from HEAD of the default branch

To Reproduce Steps to reproduce the behavior:

  1. Create a repo-only PAT on an account that has member access to an organization repo where they are given the admin rights on the repository. (note: the user is not given admin access to the org)
  2. Configure webhook on the repo, setup actions controller, install PAT
  3. Create RunnerDeployment and HPA pointing at the repo
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: bar-runner
spec:
  template:
    spec:
      repository: foo/bar
      ## using eks / oidc
      # image: image:tag
      # serviceAccountName: svc-account-name
      # securityContext:
      #   fsGroup: 1000
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: bar-hra
spec:
  scaleTargetRef:
    name: bar-runner
  scaleUpTriggers:
  - githubEvent: {}
  minReplicas: 0
  maxReplicas: 10
  1. Trigger a workflow, webhook controller should identify the HRA (but fail to take action):
Found 1 HRAs by key     {"key": "foo/bar"}
Found 0 HRAs by key     {"key": "foo"}
no repository runner or organizational runner found     {"event": "workflow_job", "hookID": "332463031", "delivery": "37a4e5c0-591d-11ec-8c42-e1ea428feb2f", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "Linux", "x64"], "repository.name": "bar", "repository.owner.login": "foo", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "repository": "foo/bar", "organization": "foo"}
Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event {"event": "workflow_job", "hookID": "332463031", "delivery": "37a4e5c0-591d-11ec-8c42-e1ea428feb2f", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "Linux", "x64"], "repository.name": "bar", "repository.owner.login": "foo", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued"}
  1. Cancel the workflow, controller should both find the HRA and do something:
Found 1 HRAs by key     {"key": "foo/bar"}
job scale up target is repository-wide runners  {"event": "workflow_job", "hookID": "332463031", "delivery": "b68560f0-5921-11ec-8d7a-9f951ff061e6", "workflowJob.status": "completed", "workflowJob.labels": [], "repository.name": "bar", "repository.owner.login": "foo", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "repository": "bar"}
Patching hra for capacityReservations update    {"before": null, "after": null}
scaled bar-hra by -1

Expected behavior I expect the HRA to scale up the RD.

Screenshots n/a

Environment (please complete the following information):

  • Helm Release - v0.15.1
  • eks - v1.21

Additional context This is weird.

kc-sn avatar Dec 09 '21 19:12 kc-sn

this appears to be related to https://github.com/actions-runner-controller/actions-runner-controller/issues/951

kc-sn avatar Dec 09 '21 19:12 kc-sn

@kcrawley-supernatural Hey! Thanks for reporting. As you've pointed out yes this seems related to #951.

Would you mind sharing your workflow yaml? I'm especially interested in what you wrote under the on field of the job that resulted in workflow_job events you saw.

I'm mainly concerned about two things now:

  • "workflowJob.labels": ["self-hosted", "Linux", "x64"] on the action=queued workflow_job event- actions-runner-controller assumes it to only contain ["self-hosted"] if you omitted runner.spec.labels(or runnerdeployment.spec.template.labels)
  • "workflowJob.labels": [] on the action=canceled workflow_job event- the same as above

mumoshu avatar Dec 12 '21 04:12 mumoshu

@kcrawley-supernatural Would you mind sharing your workflow definition YAML? I'm mainly interested in what you have under runs-on.

mumoshu avatar Dec 19 '21 01:12 mumoshu

@mumoshu sorry for ignoring this for so long.

it appears the workflow definitions runs-on must match what your actions-runner is configured for. eg:

workflow.yaml

jobs:
  task:
    runs-on: [self-hosted, Linux, x64]

runnerdeployment.yaml

kind: RunnerDeployment
metadata:
  name: task-runner
spec:
  template:
    spec:
      labels:
      - self-hosted
      - Linux
      - x64

kc-sn avatar Dec 19 '21 14:12 kc-sn

@mumoshu sorry for ignoring this for so long.

it appears the workflow definitions runs-on must match what your actions-runner is configured for. eg:

workflow.yaml

jobs:
  task:
    runs-on: [self-hosted, Linux, x64]

runnerdeployment.yaml

kind: RunnerDeployment
metadata:
  name: task-runner
spec:
  template:
    spec:
      labels:
      - self-hosted
      - Linux
      - x64

Thank you for that, it resolved my issue.

I also forgot to add the webhook in the organization level https://github.com/organizations/MY_ORG/settings/hooks, for some reason I thought that the webhook should be set in the "Installed App" that was created. After setting the webhook in the organization level, everything worked like a charm.

meirgbinahai avatar Jan 18 '22 23:01 meirgbinahai

This might be clear to others, but the key here is that if you use the automatically generated labels [self-hosted, Linux, X64] as part of the runs-on array, you must also include them as part of the RunnerDeployment spec or the webhook server won't be able to find the HorizontalRunnerAutoscaler to scale.

jgreat avatar Apr 06 '22 16:04 jgreat