actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Question: Is it possible to use fargate with the runners?

Open frbk opened this issue 3 years ago • 31 comments

I have been trying to get runners deployed on fargate and wasn't able to find any info. So far I encountered couple of issues:

  • registration-only pod does not inherit labels from spec:template which causes it to be stuck in the limbo. I was able to apply those labels using argocd.
  • When both runner and registration-only pods come up they seem to crash with this error:
Error: Client creation failed. authentication failed: using private key of size 0 (...): could not parse private key: Invalid Key: Key must be PEM encoded PKCS1 or PKCS8 private key

Here is an example of my config for fargate:

kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner-controller

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

Please let me know if you have any suggestions.

frbk avatar Jun 16 '21 04:06 frbk

@frbk Hey!

registration-only pod does not inherit labels from spec:template which causes it to be stuck in the limbo. I was able to apply those labels using argocd.

This is working as intended but might be affecting your use-case, as Fargate requires your pods to have certain labels so that Fargate can discover which pods to be deployed onto it. Perhaps we need to fix how actions-runner-controller creates a registration-only runner pod, in a way that it doesn't rely on empty labels. Or perhaps you can wait for GitHub to add some API and system changes so that we can scale from/to zero without having a registration-only runner. https://github.com/actions-runner-controller/actions-runner-controller/issues/470#issuecomment-841428853

When both runner and registration-only pods come up they seem to crash with this error:

The error says that you're trying to deploy it as a GitHub app and the private key you've provided was invalid. Check the content of the K8s secret that contains the private key.

mumoshu avatar Jun 16 '21 05:06 mumoshu

And most importantly, does Fargate supports deploying privileged containers today? In a standard setup, your runner pods and containers need to be privileged to work, especially for docker-in-docker. I thought there's some way to run dind without privileges but you need to set privileged: false on your runner spec and figure other settings out to make it work on Fargate, I think.

mumoshu avatar Jun 16 '21 05:06 mumoshu

Hey @mumoshu . Thanks for the reply. Is privileged: false part of the helm chart? Also, I am reusing the same token if I dont use fargate. I deployed two types of runners fargate one and normal one which just uses machines and that one worked fine with that token but I will investigate. Fargate doesn't work with privileged sadly. For my use case I dont need it because I am trying to run a bunch of rspec tests in the runner with some services and was planning on adding those services as sidecars.

frbk avatar Jun 16 '21 05:06 frbk

I kinda assumed that I can replica what gitlab ci doing.

frbk avatar Jun 16 '21 05:06 frbk

For my use case I dont need it because I am trying to run a bunch of rspec tests in the runner with some services and was planning on adding those services as sidecars.

@frbk Ah, gotcha! Then it should theoretically work if you set dockerEnabled: false https://github.com/actions-runner-controller/actions-runner-controller/blob/dc5f90025cdf5382d8d1b347483dacf0f3d3757b/api/v1alpha1/runner_types.go#L100-L101

But the issue on empty private key would still be a blocker. BTW, to be extra clear- which pod showed the Error: Client creation failed. authentication failed: log? actions-runner-controller, or a runner pod?

mumoshu avatar Jun 16 '21 07:06 mumoshu

privileged: false part of the helm chart?

Nope. It's computed depending on the runner spec provided by you. https://github.com/actions-runner-controller/actions-runner-controller/blob/dc5f90025cdf5382d8d1b347483dacf0f3d3757b/controllers/runner_controller.go#L705

mumoshu avatar Jun 16 '21 07:06 mumoshu

I get this error on the runner pod. actions-runner-controller is good. It seems that the secret is not being mounted when I use fargate. I am going to try mounting it in RunnerDeployment and see if that works.

frbk avatar Jun 16 '21 14:06 frbk

I have done a bit more investigating and these are the findings. It looks like runner pod is not mounting secrets when running on fargate. I was able to solve this by mounting this secrets in the RunnerDeployment and it looks like this now:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      serviceAccountName: "actions-runner-controller"
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner-controller
      env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_token
              optional: true
        - name: GITHUB_APP_ID
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_app_id
              optional: true
        - name: GITHUB_APP_INSTALLATION_ID
          valueFrom:
            secretKeyRef:
              name: controller-manager
              key: github_app_installation_id
              optional: true
        - name: GITHUB_APP_PRIVATE_KEY
          value: /etc/actions-runner-controller/github_app_private_key
      volumeMounts:
        - name: controller-manager
          mountPath: "/etc/actions-runner-controller"
          readOnly: true
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true
      volumes:
        - name: controller-manager
          secret:
            secretName: controller-manager
        - name: cert
          secret:
            defaultMode: 420
            secretName: webhook-server-cert

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

However, this doesn't seem to work because runner get stuck on authentication, also it looks like that the runner gets converted into a manager. Here is an example of the log:

2021-06-16T15:34:37.876Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-06-16T15:34:37.877Z	INFO	actions-runner-controller	Initializing actions-runner-controller	{"github-api-cahce-duration": "9m50s", "sync-period": "10m0s", "runner-image": "summerwind/actions-runner:latest", "docker-image": "docker:dind", "common-runnner-labels": null, "watch-namespace": ""}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a mutating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a mutating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a mutating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z	INFO	controller-runtime.webhook	registering webhook	{"path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-06-16T15:34:37.877Z	INFO	actions-runner-controller	starting manager
2021-06-16T15:34:37.877Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2021-06-16T15:34:37.977Z	INFO	controller-runtime.webhook.webhooks	starting webhook server
2021-06-16T15:34:37.977Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runnerreplicaset-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runnerreplicaset-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "horizontalrunnerautoscaler-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runner-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.979Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runnerdeployment-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:37.978Z	INFO	controller-runtime.certwatcher	Updated current TLS certificate
2021-06-16T15:34:37.979Z	INFO	controller-runtime.webhook	serving webhook server	{"host": "", "port": 9443}
2021-06-16T15:34:37.979Z	INFO	controller-runtime.certwatcher	Starting certificate watcher
2021-06-16T15:34:38.078Z	INFO	controller-runtime.controller	Starting Controller	{"controller": "runnerreplicaset-controller"}
2021-06-16T15:34:38.078Z	INFO	controller-runtime.controller	Starting Controller	{"controller": "horizontalrunnerautoscaler-controller"}
2021-06-16T15:34:38.079Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runner-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:38.079Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "runnerdeployment-controller", "source": "kind source: /, Kind="}
2021-06-16T15:34:38.079Z	INFO	controller-runtime.controller	Starting Controller	{"controller": "runnerdeployment-controller"}
2021-06-16T15:34:38.179Z	INFO	controller-runtime.controller	Starting workers	{"controller": "horizontalrunnerautoscaler-controller", "worker count": 1}
2021-06-16T15:34:38.179Z	INFO	controller-runtime.controller	Starting workers	{"controller": "runnerreplicaset-controller", "worker count": 1}
2021-06-16T15:34:38.179Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "github/4-10-fargate", "suggested": 1, "reserved": 0, "min": 1, "cached": 1, "max": 64}
2021-06-16T15:34:38.179Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "horizontalrunnerautoscaler-controller", "request": "github/4-10-fargate"}
2021-06-16T15:34:38.179Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:34:38.279Z	INFO	controller-runtime.controller	Starting Controller	{"controller": "runner-controller"}
2021-06-16T15:34:38.279Z	INFO	controller-runtime.controller	Starting workers	{"controller": "runnerdeployment-controller", "worker count": 1}
2021-06-16T15:34:38.280Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runnerdeployment-controller", "request": "github/4-10-fargate"}
2021-06-16T15:34:38.379Z	INFO	controller-runtime.controller	Starting workers	{"controller": "runner-controller", "worker count": 1}
2021-06-16T15:34:38.380Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-06-16 15:35:29 +0000 UTC. Retrying in 49.619892818s at latest	{"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:34:29 +0000 UTC", "registrationCheckInterval": "1m0s"}
2021-06-16T15:35:28.125Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:35:28.276Z	DEBUG	actions-runner-controller.runner	Runner pod exists but we failed to check if runner is busy. Apparently it still needs more time.	{"runner": "github/4-10-fargate-mfrgk-9sprx", "runnerName": "4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:35:28.276Z	DEBUG	actions-runner-controller.runner	Rechecking the runner registration in 1m10.468889844s	{"runner": "github/4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:35:28.288Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-06-16 15:36:28 +0000 UTC. Retrying in 58.711814172s at latest	{"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:35:28 +0000 UTC", "registrationCheckInterval": "1m0s"}
2021-06-16T15:36:27.136Z	DEBUG	actions-runner-controller.runner	Runner pod exists but we failed to check if runner is busy. Apparently it still needs more time.	{"runner": "github/4-10-fargate-mfrgk-9sprx", "runnerName": "4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:36:27.136Z	DEBUG	actions-runner-controller.runner	Rechecking the runner registration in 1m10.283034151s	{"runner": "github/4-10-fargate-mfrgk-9sprx"}
2021-06-16T15:36:27.139Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset-controller", "request": "github/4-10-fargate-mfrgk"}
2021-06-16T15:36:27.149Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-06-16 15:37:27 +0000 UTC. Retrying in 58.850736308s at latest	{"runner": "github/4-10-fargate-mfrgk-9sprx", "lastRegistrationCheckTime": "2021-06-16 15:36:27 +0000 UTC", "registrationCheckInterval": "1m0s"}

frbk avatar Jun 16 '21 15:06 frbk

@frbk Thanks. At glance, image: summerwind/actions-runner-controller you've written in RunnerDeployment spec is indeed wrong, as you are basically saying use this controller image to run this runner which results in what you see. Or are you saying that Fargate is somehow setting image: summerwind/actions-runner-controller?

mumoshu avatar Jun 16 '21 23:06 mumoshu

FYI, you can use summerwind/actions-runner images https://hub.docker.com/r/summerwind/actions-runner/tags?page=1&ordering=last_updated

mumoshu avatar Jun 16 '21 23:06 mumoshu

OMG! Thanks for pointing out that I was using the wrong image. I am going to update it and redeploy. Will update you shortly!

frbk avatar Jun 16 '21 23:06 frbk

@frbk Thanks for confirming! To be extra sure, let me point out that you should omit env like GITHUB_TOKEN. Necessary envs are configured by the controller so you shouldn't be required to do it yourself. Please share your latest RunnerDeployment YAML and I can verify if its good/bad!

mumoshu avatar Jun 17 '21 00:06 mumoshu

@mumoshu Here is an updated config which seem to work on fargate:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  template:
    metadata:
      labels:
        fargate: "true"
        eks.amazonaws.com/fargate-profile: "github"
    spec:
      repository: <some/repo>
      labels:
        - 4-10-fargate
      resources:
        requests:
          cpu: "4.0"
          memory: "10Gi"
          ephemeral-storage: "5Gi"
      dockerEnabled: false
      image: summerwind/actions-runner
      sidecarContainers:
        - name: mysql
          image: mysql:latest
          env:
            - name: MYSQL_USER
              value: root
            - name: MYSQL_ALLOW_EMPTY_PASSWORD
              value: "true"
        - name: elasticsearch
          image: elasticsearch:latest
        - name: redis
          image: redis:latest
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: 4-10-fargate
  namespace: github
spec:
  scaleTargetRef:
    name: 4-10-fargate
  minReplicas: 0
  maxReplicas: 64
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <some/repo>

I only had to manually update config for registration-only pod to include labels as I mentioned before.

frbk avatar Jun 17 '21 05:06 frbk

@frbk Awesome! Thanks a lot for sharing your experience!

I only had to manually update config for registration-only pod to include labels as I mentioned before.

I was thinking about this a bit- this can possibly be automated by just removing this line from actions-runner-controller code:

https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/controllers/runnerreplicaset_controller.go#L162

It would be great if you could try removing the code, building and pushing a custom image by running DOCKER_USER=$YOUR_DOCKERHUB_ACCOUNT_NAME make docker-build docker-push, and redeploying your controller to see if it resolves your issue 🙏

FYI, you can find definitions for docker-build and docker-push targets at https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/Makefile#L120-L122 and https://github.com/actions-runner-controller/actions-runner-controller/blob/f2e2060ff8cbba6ab18e898e240ddf4afd65eb27/Makefile#L137-L139.

mumoshu avatar Jun 17 '21 06:06 mumoshu

Thanks for the info! Will give this a shot.

frbk avatar Jun 17 '21 13:06 frbk

@mumoshu Tried your suggestions and removing runnerForScaleFromToZero.ObjectMeta.Labels = nil seemed to work! :tada:

frbk avatar Jun 18 '21 01:06 frbk

@frbk Awesome! Jus to be sure, did scale to/from zero both worked and replicas numbers shown in kubectl get runnerdeployment seem correct?

mumoshu avatar Jun 18 '21 01:06 mumoshu

Looks like it @mumoshu . Example of nothing running on the ci:

NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
4-10-fargate         0         0         0            0           7m27s

Executed one job on the ci:

NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
4-10-fargate         1         1         1            0           18m

For testing purposes I set maxReplicas to 1.

frbk avatar Jun 18 '21 02:06 frbk

Just finished running a pipeline with 18 jobs in it and it was able to scale up and down with no issues :tada:

frbk avatar Jun 18 '21 02:06 frbk

@frbk Thanks a lot for confirming! Let me add this to our documentation with a big "thanks to @frbk" note, and also apply the patch https://github.com/actions-runner-controller/actions-runner-controller/issues/631#issuecomment-862959111 to our main branch so that you no longer need to use the fork just for the one-line change.

As this being an open-source and open-development project, I would also appreciate it very much if you could submit any pull request for any of (or even both) changes yourself!

mumoshu avatar Jun 21 '21 03:06 mumoshu

Going to open a pr related to everything we talked about in this issue. Was gathering some info for documentation.

frbk avatar Jun 22 '21 12:06 frbk

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 22 '21 13:07 stale[bot]

@mumoshu Tried your suggestions and removing runnerForScaleFromToZero.ObjectMeta.Labels = nil seemed to work!

Probably this code change on runnerForScaleFromToZero isn't needed anymore. We no longer create registration-only runners for scale-from-to-zero in recent versions of ARC.

mumoshu avatar Apr 14 '22 00:04 mumoshu

@frbk Hey! How have your fargated-based runners been working since then?

mumoshu avatar Apr 14 '22 00:04 mumoshu

Hey @mumoshu. I have moved away to another company from then, however they were working fine when you didnt need to use docker in docker. I will try to provide a bit more info later this week. Just need to go over my old notes. Also, I see you changes the implementation for scaling from zero. I will try this over the weekend and will let you know.

frbk avatar Apr 14 '22 01:04 frbk

Give me couple more days. Had to setup a test eks cluster and it took a bit longer than I was expecting. Will update after I try out latest version of controller.

frbk avatar Apr 18 '22 13:04 frbk

Didn't forget about this. Schedule is a bit all over the place at the moment. 😭

frbk avatar Apr 26 '22 02:04 frbk

@frbk Thanks! I'm looking forward to your report ☺️

mumoshu avatar Apr 26 '22 02:04 mumoshu

Hi @frbk @mumoshu I'm looking to implement the runner on fargate as well, anything I should be aware of? does the runnerForScaleFromToZero.ObjectMeta.Labels = nil still needed?

NoamGoren avatar Jul 03 '22 15:07 NoamGoren

@NoamGoren Honestly, I have never tried it myself so I'm afraid I have nothing to share with you! What I can say, FWIW, is that ARC does not rely on registration-only runners anymore. So there may be a chance that it would work without any modifications now.

mumoshu avatar Jul 03 '22 21:07 mumoshu