actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

RunnerSet Cannot Deploy to Different Namespaces in the same Cluster

Open rxa313 opened this issue 4 years ago • 24 comments

Unlike RunnerDeployments, RunnerSets cannot be deployed in the same Cluster even while being in a different Namespace.

I was previously able to successfully deploy RunnerDeployments to the same Clusters all with their own controllers and had no issues with them having shared values (different helm release names - see 782).. Now, I'm deploying RunnerSet and am seeing an issue where the API call is being shared from a separate namespace, even with watchNamespace set to the namespace of the controller.

Currently running helm chart version 0.13.2 and actions-runner-controller version 0.20.2

To reproduce, create a namespace, deploy a controller with a unique release name, deploy a RunnerSet to that controller. Create another namespace, deploy a controller with a unique release name AND a unique GitHub App installation ID, OR (for my case) a PAT for GitHub Enterprise Cloud/GitHub Enterprise Server (whichever you want, in my case I have two RunnerSets for GHES for two separate organizations using the same PAT, when I try to deploy to our organization in GHEC and use a new PAT there for that instance, I am seeing the error:

create Pod xx-ghec-runnerset-0 in StatefulSet xx-ghec-runnerset failed error: admission webhook "mutate-runner-pod.webhook.actions.summerwind.dev" denied the request: failed to create registration token: POST https://github.xxx.com/api/v3/orgs/xxx/actions/runners/registration-token: 404 Not Found []

^ the API call above should be pointed to github.com, but instead is pointed to my GHES URL, when in my values.yml I did not specify any GitHub Enterprise Server URL:

env:
  GITHUB_ENTERPRISE_URL: ""

I expect to be able to deploy RunnerSet the same way we can deploy RunnerDeployments to different Namespaces within the same Cluster.

I'm moving to RunnerSets for now as it has the ability to accept dnsPolicy: Default which solves timeout errors we were seeing when deploying to our Kubernetes cluster. Would like to go back to RunnerDeployments if the dnsPolicy behavior becomes available there as we want to utilize autoscaling.

If this functionality will not be supported for RunnerSets and you guys recommend to only use RunnerDeployments please let me know. Thanks!

rxa313 avatar Oct 07 '21 18:10 rxa313

@rxa313 Thanks for reporting! I have never thought about this but in my theory, this happens due to the admission webhook server from each unique installation of actions-runner-controller are interfering with each other.

We use the admission webhook server to inject github tokens to runner pods managed by StatefulSet managed by RunnerSet. RunnerDeployment doesn't need an admission webhook server to inject tokens to runner pods as we have full control over how we create pods in that case. So, that's the difference.

Probably we need to somehow tweak our admission webhook server code:

https://github.com/actions-runner-controller/actions-runner-controller/blob/8657a34f3275f72f67b4c21ce6482e444d087c55/main.go#L255-L259

Or tweak our webhook config so that each admission webhook is involved only when the target pod is in the desired namespace.

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/charts/actions-runner-controller/templates/webhook_configs.yaml

mumoshu avatar Oct 07 '21 23:10 mumoshu

@rxa313 I have some good news. There's namespaceSelector in an admission webhook configuration that allows you to restrict that webhook to be involved only in certain namespaces.

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-namespaceselector

The below is the config for the webhook that intercepts pods managed by runnersets:

https://github.com/actions-runner-controller/actions-runner-controller/blob/5805e39e1fb9d5b433352bfe7c792782663b7500/charts/actions-runner-controller/templates/webhook_configs.yaml#L71-L92

Given all that, let's say you had two releases one and two, where one has watchNamespace=foo and two has watchNamespace=bar.

You label the namespace foo with a label actions-runner-controller/id: "one", and label bar with actions-runner-controller/id: "two"`.

kind: Namespace
metadata:
  name: foo
  labels:
    actions-runner-controller/id: "one"
---
kind: Namespace
metadata:
  name: bar
  labels:
    actions-runner-controller/id: "two"

In one's webhook config you set namespaceSelector like:

  name: mutate-runner-pod.webhook.actions.summerwind.dev
  namespaceSelector:
    matchExpressions:
    - key: actions-runner-controller/id
      operator: In
      values: ["one"]
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods

In two's webhook config you set it like:

  name: mutate-runner-pod.webhook.actions.summerwind.dev
  namespaceSelector:
    matchExpressions:
    - key: actions-runner-controller/id
      operator: In
      values: ["two"]
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods

This way, ones webhook tries to mutate runnerset pods in foo only, where twos webhook tries to mutate pods in bar only.

mumoshu avatar Oct 10 '21 00:10 mumoshu

@mumoshu

Thanks for the information - this is very interesting.

I want to confirm that this worked for me - it's a great feature for those of us who have both GHES and GHEC. I now have a namespace with a RunnerSet for GHEC and another namespace with RunnerSets for my GHES organizations and no conflicts so far.

I also am just curious, Autoscaling is not a feature of RunnerSet or is it available as well? I tried implementing but didn't see any autoscaling with the RunnerSet.

rxa313 avatar Oct 11 '21 12:10 rxa313

@rxa313 Thanks for your confirmation! Glad to hear it worked for you ☺️ And also thanks for sharing your experience. It would help everyone who will try the same setup in the future.

I also am just curious, Autoscaling is not a feature of RunnerSet or is it available as well? I tried implementing but didn't see any autoscaling with the RunnerSet.

How did you configure scale up triggers?

I may have missed documenting it, but RunnerSet supports autoscale by new workflow_job webhook events only. Please see https://github.com/actions-runner-controller/actions-runner-controller#example-1-scale-on-each-workflow_job-event for how you could set up workflow_job based autoscale.

Do you need support for events other than workflow_job for RunnerSet? If so, may I ask why?

Also, which runner kinds are you using, repository, organizational, or enterprise runners?

I recently realized that we have no implementation for enterprise runner autoscale at all. It has been implemented in #906 and verified by @roshvin at https://github.com/actions-runner-controller/actions-runner-controller/issues/892#issuecomment-950136634.

So If you're trying to autoscale enterprise runners, it would be super helpful if you could give it a try by building a custom build of actions-runner-controller and deploying it onto your cluster. The process described in #908 would be helpful for that.

mumoshu avatar Oct 24 '21 04:10 mumoshu

Hi @mumoshu

Thanks for the info. We're not configuring autoscaling at the enterprise level, only at our organization level at this time.

When setting up the workflow_job webhook events, and the webhook configuration. Is there some payload_url that is supposed to be reached and set-up in my github instance?

I configured the install of the github webhook in my values.yaml file, but only changed enabled: true:

githubWebhookServer:
  enabled: true
  replicaCount: 1
  syncPeriod: 1m
  secret:
    create: true
    name: "github-webhook-server"
....

Is there some further configuration that is needed to setup the webhook? In my runnerset yaml file I configured this as well:

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
   name: dev-runner-auto-scaler-deployment
spec:
  minReplicas: 1
  maxReplicas: 20
  scaleTargetRef:
    name: dev-corp-org-runner-deployment
  scaleUpTriggers:
  - githubEvent: {}
    duration: "3m"

I applied the changes and see the Workload: actions-runner-controller-test-github-webhook-server is deployed in rancher, but when I run a workflow with say 4 jobs in the queue, nothing is scaled.

Appreciate your help.

Update 2 I removed all runnersets and HA configurations and deployed a RunnerDeployment to my namespace. I notice the same thing is happening here as well, nothing is autoscaling even after the webhook server is installed. I'm sure something is not being connected properly.

also my controller logs don't look like the logs I've see from this issue. I don't see any "event:", "workflow_job" in my manager logs

rxa313 avatar Oct 25 '21 17:10 rxa313

@rxa313 Hey! I guess you missed exposing the webhook autoscaler's HTTP server via a loadbalancer or an ingress, and/or registering the webhook server's URL on GitHub UI.

mumoshu avatar Oct 25 '21 23:10 mumoshu

@mumoshu You're right, that is what I missed.

It might be worth documenting that in the values.yml file you must edit enabled: true, service: type: NodePort, nodePort: xxx

githubWebhookServer:
  enabled: true
  replicaCount: 1
  syncPeriod: 1m
  secret:
    create: true
    name: "github-webhook-server"
    ### GitHub Webhook Configuration
    #github_webhook_secret_token: ""
  imagePullSecrets: []
  nameOverride: ""
  fullnameOverride: ""
  serviceAccount:
    # Specifies whether a service account should be created
    create: true
    # Annotations to add to the service account
    annotations: {}
    # The name of the service account to use.
    # If not set and create is true, a name is generated using the fullname template
    name: ""
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  # fsGroup: 2000
  securityContext: {}
  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}
  priorityClassName: ""
  service:
    type: NodePort
    ports:
      - port: 80
        targetPort: http
        protocol: TCP
        name: http
        nodePort: xxx
  ingress:
    enabled: false
    annotations:
      {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"
    hosts:
      - host: chart-example.local
        paths: []
    tls: []
    #  - secretName: chart-example-tls
    #    hosts:
    #      - chart-example.local

Now that this is connected and I sent ping to server with proper status, I'm unfortunately still getting a response 200 which is good, but the body says: no horizontalrunnerautoscaler to scale for this github event

I'm seeing in my webhook server debug issues similar to This Comment

2021-10-26T15:06:04.823Z	DEBUG	controllers.Runner	Found 0 HRAs by key	{"key": "xxx"}
2021-10-26T15:06:04.823Z	DEBUG	controllers.Runner	Found 0 HRAs by key	{"key": "xxx"}
2021-10-26T15:06:04.823Z	DEBUG	controllers.Runner	no repository runner or organizational runner found	{"event": "workflow_job", "hookID": "xxx", "delivery": "xxxx", "workflowJob.status": "completed", "workflowJob.labels": [], "repository.name": "xxx", "repository.owner.login": "xxx", "repository.owner.type": "Organization", "action": "completed", "repository": "xxx/xxx", "organization": "xx-CORP_IT"}

I notice in this log is states

Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event

I was under the impression that RunnerSet is supported by this, not just RunnerDeployment?

My runnerset.yaml file:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: dev-example-runnerset
spec:
  # MANDATORY because it is based on StatefulSet: Results in a below error when omitted:
  #   missing required field "selector" in dev.summerwind.actions.v1alpha1.RunnerSet.spec
  selector:
    matchLabels:
      app: dev-example-runnerset

  # MANDATORY because it is based on StatefulSet: Results in a below error when omitted:
  # missing required field "serviceName" in dev.summerwind.actions.v1alpha1.RunnerSet.spec]
  serviceName: dev-example-runnerset

  #replicas:

  # From my limited testing, `ephemeral: true` is more reliable.
  # Seomtimes, updating already deployed runners from `ephemeral: false` to `ephemeral: true` seems to
  # result in queued jobs hanging forever.
  ephemeral: true

  organization: XX-CORP-IT
  #
  # Custom runner image
  #
  image: xxx/summerwind/xx-runner_test:5
  #
  # dockerd within runner container
  #
  ## Replace `mumoshu/actions-runner-dind:dev` with your dind image
  #dockerdWithinRunnerContainer: true
  #
  # Set the MTU used by dockerd-managed network interfaces (including docker-build-ubuntu)
  #
  dockerMTU: 1450
  
  #Runner group
  # labels:
  # - "mylabel 1"
  # - "mylabel 2"
  labels:
  - rancher-runner-set
  
  #group: RunnerSetGroupTest
  #
  # Non-standard working directory
  #
  # workDir: "/"
  template:
    metadata:
      labels:
        app: dev-example-runnerset
    spec:
      dnsPolicy: Default
      containers:
      - name: runner
        imagePullPolicy: IfNotPresent
        env:
        - name: NODE_EXTRA_CA_CERTS
          value: /usr/local/share/ca-certificates/xxx.crt
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
   name: dev-example-auto-scaler-runnerset
spec:
  minReplicas: 1
  maxReplicas: 8
  scaleTargetRef:
    name: dev-example-runnerset
  scaleUpTriggers:
  - githubEvent: {}
    duration: "3m"

HRA Desciption:

❯ kubectl describe HorizontalRunnerAutoscaler dev-example-auto-scaler-runnerset -n actions-runner-system
Name:         dev-example-auto-scaler-runnerset
Namespace:    actions-runner-system
Labels:       <none>
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         HorizontalRunnerAutoscaler
Metadata:
  Creation Timestamp:  2021-10-26T12:25:03Z
  Generation:          4
  Managed Fields:
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:maxReplicas:
        f:minReplicas:
        f:scaleTargetRef:
          .:
          f:name:
        f:scaleUpTriggers:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2021-10-26T14:56:39Z
  Resource Version:  681894630
  Self Link:         /apis/actions.summerwind.dev/v1alpha1/namespaces/actions-runner-system/horizontalrunnerautoscalers/dev-example-auto-scaler-runnerset
  UID:               xxx
Spec:
  Max Replicas:  8
  Min Replicas:  1
  Scale Target Ref:
    Name:  dev-example-runnerset
  Scale Up Triggers:
    Duration:  3m
    Github Event:
Events:  <none>

UPDATE @mumoshu ~~I've deployed a RunnerDeployment and removed the HRA for the RunnerSet and I can see that the RunnerDeployment scales based on the workflow_job webhook just fine. I triggered 4 jobs and 4 additional runners automatically spun up. I believe this confirms that RunnerSet might not support HRA based on github webhook, unless you can see that there is something missing from my above configurations.~~

rxa313 avatar Oct 26 '21 15:10 rxa313

UPDATE 2 @mumoshu

Once again, apologies for the multiple comments on this thread. I've done some more digging, my bad - HRA does indeed support RunnerSet. I found this digging through some issue/merges and came upon this: merge

I was missing a critical step in the HRA configuration: must change kind: RunnerSet <- This is critical:

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
   name: dev-example-auto-scaler-runnerset
spec:
  minReplicas: 1
  maxReplicas: 8
  scaleTargetRef:
    kind: RunnerSet
    name: dev-example-runnerset
  scaleUpTriggers:
  - githubEvent: {}
    duration: "3m"

One question for you, how can we scale down more efficiently? I have it set to duration: 3m but the runners are still existing after 3m have passed.

rxa313 avatar Oct 26 '21 17:10 rxa313

@rxa313 Thanks for testing and sharing your experience!! Regarding your last question, duration: 3m is noop as scale-down happens without a delay controlled by that setting in a case of workflow_job based autoscale.

Perhaps you'd need to configure HRA.spec.scaleDownDelaySecondsAfterScaleUp to a very small value of your choice.

The default value for it is 10 minutes (https://github.com/actions-runner-controller/actions-runner-controller/blob/98da4c2adb64e253197129e701a61d7faa3427d6/controllers/horizontalrunnerautoscaler_controller.go#L496-L500) which basically means that the scale-down on workfow_job of completed is deferred until it passes 10 minutes after the corresponding scale-up triggered before by aworkflow_job of queued.

Make it small as possible, so that scale-down happens as early as possible. But I haven't tried lowering it too much myself as it is for preventing scale up/down flapping. I'd appreciate your test report!

mumoshu avatar Oct 27 '21 00:10 mumoshu

@mumoshu

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
   name: dev-example-auto-scaler-runnerset
spec:
  scaleDownDelaySecondsAfterScaleUp: 1m
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    kind: RunnerSet
    name: dev-example-runnerset
  scaleUpTriggers:
  - githubEvent: {}
    duration: 3m
error validating data: ValidationError(HorizontalRunnerAutoscaler.spec): unknown field "scaleDownDelaySecondsAfterScaleUp" in dev.summerwind.actions.v1alpha1.HorizontalRunnerAutoscaler.spec; if you choose to ignore these errors, turn validation off with --validate=false

What am I doing wrong here?

rxa313 avatar Oct 27 '21 13:10 rxa313

https://github.com/actions-runner-controller/actions-runner-controller/blob/b805cfada7863a23946b5ae362d4c071dbaa7a43/config/crd/bases/actions.summerwind.dev_horizontalrunnerautoscalers.yaml#L99-L101 try setting scaleDownDelaySecondsAfterScaleOut instead

toast-gear avatar Oct 27 '21 14:10 toast-gear

@toast-gear ,

Thanks so much, that was the issue. The former does not work in my instance, but scaleDownDelaySecondsAfterScaleOut: 60 worked.

@mumoshu

In my testing, setting to 60 seconds or 120 seconds scaled down a lot smoother than the default 10m for myself, especially when testing how it works. Scaling down as early as a minute is a good amount of time for my organization, I'll have more testing once it's deployed into production.

Is there some type of delay when kicking off scaling up from 0? I'm sure you've seen issues where scaling up from 0 cause a no runners found with label error appears. How can we trigger without erroring before the runner spins up.

rxa313 avatar Oct 27 '21 14:10 rxa313

@rxa313 Hey! Thanks for testing it out. Apparently I made a typo when implementing scaleDownDelaySecondsAfterScaleOut 😅

Re no runners found with label errors, I believe it happens only on GHES(GitHub Enterprise Server) but GitHub Cloud.

For GitHub Enterprise, you'll probably receive a fix on GitHub Enterprise Server 3.3(I think that's the next version). Until then, you'd need to use RunnerDeployment to avoid no runners found with label error on GHES.

Summary: Scale-from-zero works with...

  • GHES 3.2 + RunnerDeployment (thanks to our registration-only runners)
  • GH + RunnerDeployment (GH supports scale-from-zero out-of-box since 20 Sep https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/
  • GH + RunnerSet (Same as above

Scale-from-zero doesn't work with...

  • GHES 3.2 + RunnerSet (GHES 3.2 doesn't support scale-from-zero out-of-box and RunnerSet doesn't support registration-only runners

mumoshu avatar Oct 27 '21 23:10 mumoshu

@rxa313 you'll know if 3.3 has the fix in because the routing logic documentation for Enterprise Server 3.3 will match the logic in the Enterprise Cloud version of the article (top right corner at the top of the page). You can see Enterprise Server 3.2 has the old routing logic still and so registration runners are needed to scale from zero with =< 3.2.

toast-gear avatar Oct 28 '21 10:10 toast-gear

@mumoshu @toast-gear,

Thanks guys for the insight and information on all of this. I'm going to configure this on my production Rancher & GitHub environments. Please let me know if you need me to test anything else regarding this issue.

rxa313 avatar Oct 28 '21 22:10 rxa313

@mumoshu (@toast-gear as well for documentations sake),

I've just deployed my RunnerSet to my production Rancher environment. I enabled the github-webhook and HRA for GHES so far. I have some reporting for you..

Re: no runners found with label errors, I believe it happens only on GHES(GitHub Enterprise Server) but GitHub Cloud. For GitHub Enterprise, you'll probably receive a fix on GitHub Enterprise Server 3.3(I think that's the next version). Until >then, you'd need to use RunnerDeployment to avoid no runners found with label error on GHES.

We've ran a GHES server upgrade and are now running in version: 3.2.1., it's obviously not the same as GHEC, but there are some new features.

I've noticed that autoscaling does work on this version of GHES, however, there's a weird notification in the debug of the github-webhook-server logs:

2021-10-29T18:21:05.615Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "workflow_job", "hookID": "xxx", "delivery": xxx", "workflowJob.status": "in_progress", "workflowJob.labels": [], "repository.name": "xx-rancher-actions-runner", "repository.owner.login": "XX-CORP-IT", "repository.owner.type": "Organization", "action": "started"}
2021-10-29T18:21:06.483Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "workflow_job", "hookID": "xxx", "delivery": "xxx", "workflowJob.status": "in_progress", "workflowJob.labels": [], "repository.name": "xx-rancher-actions-runner", "repository.owner.login": "XX-CORP-IT", "repository.owner.type": "Organization", "action": "started"}

While this would appear to be an issue because of: Scale target not found notification -- it actually does scale up and down accordingly as the INFO right after the above log is:

2021-10-29T18:21:18.628Z	DEBUG	controllers.Runner	Found 0 HRAs by key	{"key": "XX-CORP-IT/xxxx"}
2021-10-29T18:21:18.628Z	DEBUG	controllers.Runner	Found 1 HRAs by key	{"key": "XX-CORP-IT"}
2021-10-29T18:21:18.628Z	INFO	controllers.Runner	job scale up target is organizational runners	{"event": "workflow_job", "hookID": "xxxx", "delivery": "xxxxx", "workflowJob.status": "completed", "workflowJob.labels": [], "repository.name": "xxxxx", "repository.owner.login": "XX-CORP-IT", "repository.owner.type": "Organization", "action": "completed", "organization": "XX-CORP-IT"}
2021-10-29T18:21:18.628Z	INFO	controllers.Runner	Patching hra for capacityReservations update	{"before": [{"expirationTime":"2021-10-29T18:23:33Z","replicas":1},{"expirationTime":"2021-10-29T18:23:33Z","replicas":1},{"expirationTime":"2021-10-29T18:23:34Z","replicas":1}], "after": [{"expirationTime":"2021-10-29T18:23:33Z","replicas":1},{"expirationTime":"2021-10-29T18:23:34Z","replicas":1}]}

And this info continues for however many runners it needs to spin up; and in return also scales down:

2021-10-29T18:21:37.495Z	INFO	controllers.Runner	Patching hra for capacityReservations update	{"before": null, "after": null}
2021-10-29T18:21:37.504Z	INFO	controllers.Runner	scaled xx-corp-it-runner-autoscaler by -1

However, if I take a look at the Recent Deliveries on my GHES Webhook, I can see the last 4 or so say:

scaled xx-corp-it-runner-autoscaler by -1

THEN, the sixth delivery says something like:

no horizontalrunnerautoscaler to scale for this github event

I'm not sure what event this is referring too.. but maybe I need to add another hook to catch whatever this event is referring to.


A special note:

Within my Rancher cluster, I'm going to spin up RunnerSets with HRA in GHEC, and in GHES for two organizations.

I believe (and am going to) spin these up in separate namespaces. We already know that labeling the namespaces is the route to take, but when it comes to GHES and my two organizations RunnerSets, I believe to avoid potential conflicts and muddying up the logs, users should only have one RunnerSet and one HRA per namespace in Rancher, i. e A namespace for org 1 with it's own RunnerSet and HRA and a namespace for org 2 with it's own config. Can you provide any insight to that?

Thanks!

rxa313 avatar Oct 29 '21 18:10 rxa313

@rxa313 Hey! Thanks a lot for the detailed report.

I've noticed that autoscaling does work on this version of GHES

Oh, awesome! Thanks for letting us know.

I've read the release note of that GHES version but couldn't find any specific change regarding ephemeral runners and new workflow_job webhook events.

Perhaps I'd better ask GitHub folks for confirmation.

Scale target not found notification -- it actually does scale up and down accordingly as the INFO right after the above log is:

Ah, good catch. I've checked our implementation and this should be ok although it looks scary.

The important point here is that each workflow_job event has an action field with any of these values- queued, in_progress, or completed.

We currently handle only queued and completed, where the former triggers a scale-up and the latter triggers a scale-down.

in_progress, on the other hand, is unhandled and results in Scale target not found.

https://github.com/actions-runner-controller/actions-runner-controller/blob/f7e14e06e87961a67f46e22cc33ba2c707b6e682/controllers/horizontal_runner_autoscaler_webhook.go#L201-L223

We'd better fix this so that there will be redundant (and scary) error messages.

mumoshu avatar Oct 31 '21 00:10 mumoshu

@rxa313 I've reread the special note part on your last comment and am still unsure what's your goal is.

Are you basically asking me to verify if your below idea seems valid to me, or anything else?

users should only have one RunnerSet and one HRA per namespace in Rancher, i. e A namespace for org 1 with it's own RunnerSet and HRA and a namespace for org 2 with it's own config. Can you provide any insight to that?

mumoshu avatar Oct 31 '21 07:10 mumoshu

@rxa313 I believe https://github.com/actions-runner-controller/actions-runner-controller/pull/927 removes the redundant Scale target not found error messages. Would you mind giving it a try?

mumoshu avatar Oct 31 '21 09:10 mumoshu

@mumoshu

Are you basically asking me to verify if your below idea seems valid to me, or anything else?

Yeah, can you verify if you can have one HRA/webhook managing two separate RunnerSets? I don't have my RunnerSet operating at an enterprise level, I have them at the org level.

What I did was create separate namespaces & controllers for each organization and installed a RunnerSet/HRA in their respective namespaces.

I wasn't sure if an HRA can support two RunnerSets in the same namespace.

As far as the fix in #927.. I ran a few jobs and checked my logs and I can still see that in_progess workflows are throwing Scale target not found

Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "workflow_job", "hookID": "x", "delivery": "x", "workflowJob.status": "in_progress", "workflowJob.labels": [], "repository.name": "xxx", "repository.owner.login": "XX-CORP-IT", "repository.owner.type": "Organization", "action": "started"}

I have a question about the AutoScaling functionality, limitations or recommendations

Currently, my companies internet security is blocking GHEC from coming into our internal network, (it's not just GitHub, they don't allow external internet through to our internal network), but our internal network can reach out to GitHub, etc.

Is there a way to reach into GitHub to get the information to determine if runners need to be spun up?

Basically, instead of GHEC's webhook reaching in to our internal Rancher environment to send information; is the reverse possible, where Rancher reaches out to GHEC and checks for workflow_job/other webhook events?

  • if this is not possible, do you have a recommendation for autoscaling using RunnerSet that will work for my use-case? The only solution I can think of, (while maybe a bit less efficient then webhook autoscaling), is PercentageRunnersBusy
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: example-runner-deployment-autoscaler
spec:
  # Runners in the targeted RunnerDeployment won't be scaled down for 5 minutes instead of the default 10 minutes now
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    name: example-runner-deployment
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.25'
    scaleUpFactor: '2'
    scaleDownFactor: '0.5'

As TotalNumberOfQueuedAndInProgressWorkflowRuns requires you to provide a list of repositories, which won't work for me.

rxa313 avatar Nov 02 '21 12:11 rxa313

Basically, instead of GHEC's webhook reaching in to our internal Rancher environment to send information; is the reverse possible, where Rancher reaches out to GHEC and checks for workflow_job/other webhook events?

@rxa313 I believe a short-term solution would be our github webhook delivery forwarder I've built in https://github.com/actions-runner-controller/actions-runner-controller/pull/682.

We don't yet have a container image build and a Helm chart support for that component so you'd need to build your own container image for the forwarder and a K8s deployment to deploy it onto your cluster.

mumoshu avatar Nov 03 '21 01:11 mumoshu

@rxa313 Also, I bet PercentageRunnersBusy and TotalNumberOfQueuedAndInProgressWorkflowRuns wont be your solution.

Instead, you'd better file a feature request to GitHub so that they can add something like "List Workflow Jobs" API whose response includes all the jobs that are queued but missing runners and waiting for new runners to be available. The response should also include runner labels for missing runners of the queued jobs.

If we had such API, it's very easy to enhance HRA to provide an easy-to-configure autoscale functionality like you have today with workflow_job webhook events, without webhook.

mumoshu avatar Nov 03 '21 01:11 mumoshu

one HRA/webhook managing two separate RunnerSets

No, that's not possible! You need to configure one HRA per RunnerSet, because the controller maps one "queued" webhook event to an HRA, and then maps the HRA to a RunnerSet or a RunnerDeployment.

mumoshu avatar Nov 03 '21 01:11 mumoshu

@mumoshu

We don't yet have a container image build and a Helm chart support for that component so you'd need to built your own container image for the forwarder and a K8s deployment to deploy it onto your cluster.

Is there plans to release this feature in near future? I'd love to test it out without having to create my own images in my cluster :)

Thanks for all the info.

rxa313 avatar Nov 03 '21 13:11 rxa313