actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Controller is "leaking" offline runners (i.e. is not always calling Github's API to remove them)

Open bmbferreira opened this issue 2 years ago • 15 comments

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

1.9.0

Deployment Method

Helm

cert-manager installation

  • yes
  • yes

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  creationTimestamp: "2022-08-05T08:48:21Z"
  generation: 5145
  name: autochecks
  namespace: github-runner
  resourceVersion: "22349026"
  uid: d5660e91-55bf-44a1-9a9d-de7ea59dcde3
spec:
  effectiveTime: "2022-08-16T17:05:44Z"
  replicas: 5
  selector: null
  template:
    metadata: {}
    spec:
      dockerEnabled: true
      dockerVolumeMounts:
      - mountPath: /home/runner/_work/
        name: shared
        readOnly: false
      dockerdContainerResources:
        limits:
          cpu: 1
          ephemeral-storage: 40Gi
          memory: 4Gi
        requests:
          cpu: 500m
          ephemeral-storage: 4Gi
          memory: 1Gi
      env:
      - name: RUNNER_ALLOW_RUNASROOT
        value: "1"
      image: 123123123123.dkr.ecr.us-east-1.amazonaws.com/github-runner:node-12
      imagePullSecrets:
      - name: kubernetes
      initContainers:
      - command:
        - sh
        - -c
        - cp /tmp/dockercfg/* /home/runner/.docker/
        image: public.ecr.aws/docker/library/busybox:stable-musl
        name: copy-dockerconfig
        volumeMounts:
        - mountPath: /home/runner/.docker/
          name: dockercfg
        - mountPath: /tmp/dockercfg
          name: dockercfg-secret
      labels:
      - automated-checks
      - automated-checks-ephemeral
      nodeSelector:
        eks.amazonaws.com/nodegroup: github-runners-tools-main
      organization: MyOrg
      resources:
        limits:
          cpu: 1
          ephemeral-storage: 40Gi
          memory: 1Gi
        requests:
          cpu: 1
          ephemeral-storage: 4Gi
          memory: 1Gi
      securityContext:
        fsGroup: 1000
      serviceAccountName: github-runner
      tolerations:
      - effect: NoExecute
        key: dedicated
        operator: Equal
        value: github-runners
      volumeMounts:
      - mountPath: /home/runner/.docker/
        name: dockercfg
      - mountPath: /home/runner/_work/
        name: shared
        readOnly: false
      volumes:
      - emptyDir: {}
        name: dockercfg
      - emptyDir: {}
        name: shared
      - name: dockercfg-secret
        secret:
          items:
          - key: .dockerconfigjson
            path: config.json
          secretName: dockercfg
status:
  availableReplicas: 5
  desiredReplicas: 5
  readyReplicas: 5
  replicas: 5
  updatedReplicas: 5
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  creationTimestamp: "2022-08-05T08:48:21Z"
  generation: 7479
  name: autochecks
  namespace: github-runner
  resourceVersion: "22350550"
  uid: 02cf7e48-e161-4e4b-8630-40a28e559ccb
spec:
  capacityReservations:
  - effectiveTime: "2022-08-16T17:06:41Z"
    expirationTime: "2022-08-16T17:11:41Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:46Z"
    expirationTime: "2022-08-16T17:11:46Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:47Z"
    expirationTime: "2022-08-16T17:11:47Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:47Z"
    expirationTime: "2022-08-16T17:11:47Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:49Z"
    expirationTime: "2022-08-16T17:11:49Z"
    replicas: 1
  maxReplicas: 20
  minReplicas: 1
  scaleDownDelaySecondsAfterScaleOut: 30
  scaleTargetRef:
    name: autochecks
  scaleUpTriggers:
  - duration: 5m0s
    githubEvent:
      workflowJob: {}
  scheduledOverrides:
  - endTime: "2022-05-16T05:00:00Z"
    minReplicas: 0
    recurrenceRule:
      frequency: Weekly
    startTime: "2022-05-13T22:00:00Z"
  - endTime: "2022-05-17T05:00:00Z"
    minReplicas: 0
    recurrenceRule:
      frequency: Daily
    startTime: "2022-05-16T22:00:00Z"
status:
  desiredReplicas: 6
  lastSuccessfulScaleOutTime: "2022-08-16T17:06:50Z"
  scheduledOverridesSummary: min=0 time=2022-08-16 22:00:00 +0000 UTC

To Reproduce


1. Shut down nodes constantly, or kill directly the pods doing `kubectl delete`. Or just wait a few weeks/months with cluster autoscaling scaling in and out constantly.
2. Watch the list returned by the `ListRunners` endpoint grow up to hundreds of thousands.
3. Controller goes nuts and doesn't work properly. Examples: lots of runner pods in Pending state, runner pods not getting deleted, other RunnerDeployments starving without any replicas, and so on.

Describe the bug

The ARC is "leaking" offline runners on github in specific situations, for example when the cluster autoscaler scales out nodes and there's idle runners running there, they don't "auto-remove" themselves from github and IMO that would be responsibility of the controller. This issue was probably introduced on 0.22 since on the release notes it says that there were removed unnecessary "Remove Runner" API Calls. It is visible on https://github.com/organizations/MyOrg/settings/actions/runners that the list increases every day a little bit until it starts affecting the controller that needs to execute the ListRunners function on the autoscaling algorithm (see here). This becomes a problem when we get into a situation where there is thousands of Offline runners accumulated and the controller needs to iterate over all the runners (see here). In my case, I noticed that the controller was not working correctly and there were more than 300 pages of offline runners listed for my organization. After I deleted all the offline runners, the controller started to work again. Now I'm running a script at the end of the day to remove all the offline runners that were "leaked" by the controller during the day to avoid this from happening again.

Describe the expected behavior

I would expect the controller to clean up all the offline runners from the github api in all situations. If that doesn't happen they will start to accumulate until the controller is affected when executing the ListRunners function.

Controller Logs

I don't have logs from the controller from when this was happening and now since I deleted the offline runners it is working properly. I also noticed that the `ListRunners` function (https://github.com/actions-runner-controller/actions-runner-controller/blob/538e2783d7fde279b84f1ff9351bb1486823660b/github/github.go#L219-L244) is not logging anything and therefore, logs wouldn't be helpful to debug this issue.

Runner Pod Logs

https://gist.github.com/bmbferreira/0ddb1efe1fe8e0dcfe4615aec2c6150a

Additional Context

No response

bmbferreira avatar Aug 16 '22 17:08 bmbferreira

@bmbferreira Hey! This does look like a bug. But we do need controller logs to deduce what's causing it. ARC does have a logic that ensures the runner to be removed from GitHub whenever a K8s CRD Runner resource is deleted. We use K8s finalizer to ensure that.

When a node shutdown happens, the pod might eventually fail due to missing containers and the node that it tend to have scheduled, but the K8s CRD Runner resource won't automatically disappear. It's ARC's responsibility to "clean up" the Runner whose pod disappeared. And the cleanup process does include the step that unregisters/removes the runner object from GitHub so that there will be no offline runners remaining.

Is it that ARC somehow removed Runner resource before unregistering the runner from GitHub, and/or it somehow gave up unregistering the runner from GitHub? I think in the latter case the controller log should contain some errors due to GitHub API failures. If you have other data points/information to share, that would be super helpful!

Thanks in advance for your support.

mumoshu avatar Aug 17 '22 01:08 mumoshu

I have the same problem. here are my settings and logs from the controller for one example.

Controller Version latest (summerwind/actions-runner-controller@sha256:92faf7e9f7f09a6240cdb5eb82eaf448852bdddf2fb77d0a5669fd8e5062b97b) Helm Chart Version 0.20.2 CertManager Version 1.0.4 Deployment Method Helm (fluxcd)

ops-action-runner-controller-actions-runner-controller-78467m5n manager {"level":"error","ts":"2022-08-17T11:45:23Z","logger":"actions-runner-controller.runnerpod","msg":"Failed to update runner for finalizer removal","runnerpod":"ops/action-runner-deployment-cjlr8-crs7l","error":"pods \"action-runner-deployment-cjlr8-crs7l\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

ops-action-runner-controller-actions-runner-controller-78467m5n manager {"level":"error","ts":"2022-08-17T11:45:23Z","msg":"Reconciler error","controller":"runnerpod-controller","controllerGroup":"","controllerKind":"Pod","pod":{"name":"action-runner-deployment-cjlr8-crs7l","namespace":"ops"},"namespace":"ops","name":"action-runner-deployment-cjlr8-crs7l","reconcileID":"56d292ce-0ec6-4be1-8221-bbd1a383bcff","error":"pods \"action-runner-deployment-cjlr8-crs7l\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

Bildschirmfoto 2022-08-17 um 13 46 53

jbuettnerbild avatar Aug 17 '22 11:08 jbuettnerbild

@jbuettnerbild Thanks for sharing!

Hmm, did your K8s control-plane automatically removed the pod that was running on a disappeared node, without waiting for the pod finalizer to be removed?

I thought the K8s contract is that nothing can be normally deleted without all the finalizers are removed from the said resource by respective owners of finalizers. I don't understand how that can happen yet 🤔

Do you, by any chance, have any other jobs or k8s controllers that "force-delete" the pods that were running on a disappeared node? A force-delete here means that one that doesn't wait for the finalizer to be removed by the owner of the finalizer. kubectl delete --force does wait for the finalizer so it might not be the issue. If you periocally run kubectl patch $POD to remove finalizers and then kubectl delete $POD for pods on disappeared nods, it can cause issues like that.

mumoshu avatar Aug 17 '22 12:08 mumoshu

I thing i found the reason for this example. We use Spot Instances. And the aws node termination handler. The node where the runner was running would terminated by AWS (instance-terminated-no-capacity). My guess is that the node termination handler couldn't evacuate the runner in time.

For this i must find a solution but i found also another example where the node is still in use and the runner goes offline and won't removed from github. here are the logs from the controller:

{"level":"error","ts":"2022-08-17T11:40:56Z","logger":"actions-runner-controller.runnerpod","msg":"Failed to update runner for finalizer removal","runnerpod":"ops/action-runner-deployment-cjlr8-h9bpn","error":"pods \"action-runner-deployment-cjlr8-h9bpn\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

and shortly before from kubelet: 11:40:56.209357 4739 kubelet_pods.go:1243] Killing unwanted pod "action-runner-deployment-cjlr8-h9bpn" 11:40:56.064276 4739 kubelet.go:1955] SyncLoop (PLEG): "action-runner-deployment-cjlr8-h9bpn_ops(c5866e9b-2471-4628-8846-0dfb10e3ee0e)", event: &pleg.PodLifecycleEvent{ID:"c5866e9b-2471-4628-8846-0dfb10e3ee0e", Type:"ContainerDied", Data:"23ce8510f8d8b7a116d91b25f57cd9a1977efe1a24cf9ef481fef17c80ace77c"} 11:39:54.749364 4739 kubelet.go:1955] SyncLoop (PLEG): "action-runner-deployment-cjlr8-h9bpn_ops(c5866e9b-2471-4628-8846-0dfb10e3ee0e)", event: &pleg.PodLifecycleEvent{ID:"c5866e9b-2471-4628-8846-0dfb10e3ee0e", Type:"ContainerStarted", Data:"23ce8510f8d8b7a116d91b25f57cd9a1977efe1a24cf9ef481fef17c80ace77c"}

jbuettnerbild avatar Aug 17 '22 13:08 jbuettnerbild

Thanks!

"msg":"Failed to update runner for finalizer removal","runnerpod":"ops/action-runner-deployment-cjlr8-h9bpn","error":"pods \"action-runner-deployment-cjlr8-h9bpn\" not found"

This is the most interesting part- Who might have actually removed the pod without waiting for the finalizer? 🤔 ARC would want the pod to remain until ARC removes the finalizer from it.

mumoshu avatar Aug 17 '22 13:08 mumoshu

Hello again! So, ~24 hours have passed, developers pushed code, deployed, cluster had a lot of activity and I can see more offline runners leaked on https://github.com/organizations/MyOrg/settings/actions/runners. For example, this runner bear-4vfzj-tskmh that shows on the list as offline, if I search for it on the controller logs, this is what I get:

  | August 17th 2022, 10:37:38.734 | 2022-08-17T10:37:38Z	INFO	actions-runner-controller.runnerpod	Runner pod is marked as already unregistered.	{"runnerpod": "github-runner/bear-4vfzj-tskmh"}

  | August 17th 2022, 10:37:38.715 | 2022-08-17T10:37:38Z	INFO	actions-runner-controller.runner	Removed finalizer	{"runner": "github-runner/bear-4vfzj-tskmh"}

  | August 17th 2022, 10:37:18.022 | 2022-08-17T10:37:18Z	INFO	actions-runner-controller.runnerpod	Runner pod is marked as already unregistered.	{"runnerpod": "github-runner/bear-4vfzj-tskmh"}

  | August 17th 2022, 10:37:13.752 | 2022-08-17T10:37:13Z	INFO	actions-runner-controller.runnerpod	Runner pod is marked as already unregistered.	{"runnerpod": "github-runner/bear-4vfzj-tskmh"}

  | August 17th 2022, 10:37:13.705 | 2022-08-17T10:37:13Z	INFO	actions-runner-controller.runnerpod	Runner has just been unregistered.	{"runnerpod": "github-runner/bear-4vfzj-tskmh"}

  | August 17th 2022, 10:37:13.284 | 2022-08-17T10:37:13Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "github-runner/bear-4vfzj-tskmh", "podCreationTimestamp": "2022-08-17 10:37:07 +0000 UTC"}

  | August 17th 2022, 10:37:07.943 | 2022-08-17T10:37:07Z	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"github-runner","name":"bear-4vfzj-tskmh","uid":"5bfcca07-e8ea-457d-ad4e-7d96b7df36fb","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"22987546"}, "reason": "PodCreated", "message": "Created pod 'bear-4vfzj-tskmh'"}

  | August 17th 2022, 10:37:07.943 | 2022-08-17T10:37:07Z	INFO	actions-runner-controller.runner	Created runner pod	{"runner": "github-runner/bear-4vfzj-tskmh", "repository": ""}

  | August 17th 2022, 10:37:07.920 | 2022-08-17T10:37:07Z	DEBUG	actions-runner-controller.runnerreplicaset	Skipped reconcilation because owner is not synced yet	{"runnerreplicaset": "github-runner/bear-4vfzj", "owner": "github-runner/bear-4vfzj-tskmh", "pods": null}

  | August 17th 2022, 10:37:07.918 | 2022-08-17T10:37:07Z	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"github-runner","name":"bear-4vfzj-tskmh","uid":"5bfcca07-e8ea-457d-ad4e-7d96b7df36fb","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"22987543"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}

  | August 17th 2022, 10:37:07.918 | 2022-08-17T10:37:07Z	INFO	actions-runner-controller.runner	Updated registration token	{"runner": "bear-4vfzj-tskmh", "repository": ""}

  | August 17th 2022, 10:37:07.902 | 2022-08-17T10:37:07Z	DEBUG	actions-runner-controller.runnerreplicaset	Skipped reconcilation because owner is not synced yet	{"runnerreplicaset": "github-runner/bear-4vfzj", "owner": "github-runner/bear-4vfzj-tskmh", "pods": null}

  | August 17th 2022, 10:37:07.892 | 2022-08-17T10:37:07Z	DEBUG	actions-runner-controller.runnerreplicaset	Skipped reconcilation because owner is not synced yet	{"runnerreplicaset": "github-runner/bear-4vfzj", "owner": "github-runner/bear-4vfzj-tskmh", "pods": null}

  | August 17th 2022, 10:37:07.876 | 2022-08-17T10:37:07Z	DEBUG	actions-runner-controller.runnerreplicaset	Skipped reconcilation because owner is not synced yet	{"runnerreplicaset": "github-runner/bear-4vfzj", "owner": "github-runner/bear-4vfzj-tskmh", "pods": null}

I can also confirm that this pod is not stuck at terminating, the runner CRD does not exist anymore on the namespace but it still shows as offline on github UI/API. In total, during the last 24 hours, there were 16 "leaked" runners. After a few weeks/months at this rate the ListRunner function will start to struggle if I don't run a script to remove all of them at the end of the day. 😓

bmbferreira avatar Aug 17 '22 22:08 bmbferreira

@bmbferreira take a look at https://github.blog/changelog/2022-08-03-github-actions-remove-offline-self-hosted-runners/

jbuettnerbild avatar Aug 18 '22 07:08 jbuettnerbild

Nice hack, but still ~14 days is a bit too much 🥲~ I misread the article, is 1 day for ephemeral.

Diontsoumas avatar Aug 18 '22 07:08 Diontsoumas

oh wow! Thanks for sharing that @jbuettnerbild! I had no idea and it seems really recent! I think the thousands of offline runners I had a few days ago that accumulated for months were not covered by this recent change 😅 Well, if ephemeral runners will just stay there for one day as they say, it saves me from running a script to clean it up every day 🤷‍♂️

bmbferreira avatar Aug 18 '22 08:08 bmbferreira

Thanks for sharing! In a standard scenario though- ARC should not miss unregistering runners hence there should be no dangling offline runners (Although it might take a few minutes until a runner to disappear from GitHub). if it doesn't work, probably something in your cluster is preventing ARC from doing proper clean-up.

mumoshu avatar Aug 18 '22 08:08 mumoshu

Hi one update from my side. After updating the runner to the newest version (2.287.1 -> 2.296.0), the problem with the offline runners are solved. Now if a runner is terminated, the corresponding runner in github disappears.

in the logs i see:

action-runner-deployment-bkx2b-l7x88 runner An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.
action-runner-deployment-bkx2b-l7x88 runner Runner listener exit with retryable error, re-launch runner in 5 seconds.
action-runner-deployment-bkx2b-l7x88 runner Restarting runner...
action-runner-deployment-bkx2b-l7x88 runner
action-runner-deployment-bkx2b-l7x88 runner √ Connected to GitHub
action-runner-deployment-bkx2b-l7x88 runner
action-runner-deployment-bkx2b-l7x88 runner Failed to create a session. The runner registration has been deleted from the server, please re-configure.
action-runner-deployment-bkx2b-l7x88 runner Runner listener exit with terminated error, stop the service, no retry needed.
action-runner-deployment-bkx2b-l7x88 runner Exiting runner...
- action-runner-deployment-bkx2b-l7x88 › runner

And one more comment about the change from Github that automatically removes the old offline runner: I can't confirm that. We have a bunch of old runners marked as offline that are several days old. They may be removed after 30 days.

jbuettnerbild avatar Aug 25 '22 13:08 jbuettnerbild

@jbuettnerbild Hey! Thanks for the update. If the 24h automated ephemeral runner cleanup doesn't work, I think it's due to something going wrong on the GitHub side. Could you open a support ticket to GitHub so that they can check their backend to see what's happening?

mumoshu avatar Aug 26 '22 00:08 mumoshu

Facing the same issue. I also tried updating the runner agent to above version (2.296.0) but it didn't fix it. Running against Github Enterprise Server v3.5.4.

enricojonas avatar Aug 31 '22 14:08 enricojonas

Currently it is "working" for us. I mean "working" because runners are still "leaked" but they are de-listed from github after ~24 hours. Either way for the ones having this issue, I leave these two suggestions:

  1. have some kind of monitoring that checks periodically using github api the amount of offline runners still there (if the list is too big, that will be an issue because controller will be blocked when calling the list api)
  2. have some cron job that removes the offline runners periodically to guarantee that the list doesn't grow too much.

bmbferreira avatar Aug 31 '22 14:08 bmbferreira

BTW, this is the function ARC is using to unregister runners 100% https://github.com/actions-runner-controller/actions-runner-controller/blob/12c4d9625022859556b8288c9ee3929ebe720d54/controllers/runner_graceful_stop.go#L72

ARC does have a deletion timeout of 1 minute which can result in leaking runners when it failed to unregister the runner within 1 minute. But this can happen only when you manually deleted a Runner resource and/or runner Pod resource. https://github.com/actions-runner-controller/actions-runner-controller/blob/5651ba6eaddf46a97af68a34d8b2731cf13b6ba9/controllers/runner_pod_controller.go#L150-L177

Can you spot the issue if this is happening due to an unseen bug in ARC? I had no luck so far..

mumoshu avatar Sep 23 '22 00:09 mumoshu

We are experiencing the same issue, but the controller logs look different in our case:

022-11-29T11:13:07Z DEBUG   actions-runner-controller.runner    Runner appears to have been registered and running. {"runner": "developers-runner/developers-runner-fzfqf-9sv7l", "podCreationTimestamp": "2022-11-29 11:12:54 +0000 UTC"}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "51b932ff-3152-4e1b-853a-84ed80b17d50", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "51b932ff-3152-4e1b-853a-84ed80b17d50", "allowed": true}
2022-11-29T11:13:07Z    ERROR   actions-runner-controller.runnerreplicaset  Failed to patch owner to have actions-runner/unregistration-request-timestamp annotation    {"runnerreplicaset": "developers-runner/developers-runner-fzfqf", "lastSyncTime": "2022-11-29T11:12:53Z", "effectiveTime": "<nil>", "templateHashDesired": "5b578679f8", "replicasDesired": 10, "replicasPending": 1, "replicasRunning": 10, "replicasMaybeRunning": 11, "templateHashObserved": ["5b578679f8"], "owner": "developers-runner/developers-runner-fzfqf-9sv7l", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"developers-runner-fzfqf-9sv7l\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
    /workspace/controllers/runnerreplicaset_controller.go:131
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
2022-11-29T11:13:07Z    ERROR   Reconciler error    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "runnerReplicaSet": {"name":"developers-runner-fzfqf","namespace":"developers-runner"}, "namespace": "developers-runner", "name": "developers-runner-fzfqf", "reconcileID": "ef7ce398-edad-4d15-a648-9058ec541afe", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"developers-runner-fzfqf-9sv7l\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "b2b9c604-9651-4b04-a170-9cd20934c98d", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "b2b9c604-9651-4b04-a170-9cd20934c98d", "allowed": true}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "5647d714-c656-4300-b8c9-8628f15e672b", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    INFO    runner-resource validate resource to be updated {"name": "developers-runner-fzfqf-27gjv"}

Is anyone else seeing these logs? We're using chart version 0.20.2. We will try upgrading to the latest version today to see if that makes any difference

gss-jmorris avatar Nov 30 '22 10:11 gss-jmorris

We are experiencing the same issue, but the controller logs look different in our case:

022-11-29T11:13:07Z DEBUG   actions-runner-controller.runner    Runner appears to have been registered and running. {"runner": "developers-runner/developers-runner-fzfqf-9sv7l", "podCreationTimestamp": "2022-11-29 11:12:54 +0000 UTC"}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "51b932ff-3152-4e1b-853a-84ed80b17d50", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "51b932ff-3152-4e1b-853a-84ed80b17d50", "allowed": true}
2022-11-29T11:13:07Z    ERROR   actions-runner-controller.runnerreplicaset  Failed to patch owner to have actions-runner/unregistration-request-timestamp annotation    {"runnerreplicaset": "developers-runner/developers-runner-fzfqf", "lastSyncTime": "2022-11-29T11:12:53Z", "effectiveTime": "<nil>", "templateHashDesired": "5b578679f8", "replicasDesired": 10, "replicasPending": 1, "replicasRunning": 10, "replicasMaybeRunning": 11, "templateHashObserved": ["5b578679f8"], "owner": "developers-runner/developers-runner-fzfqf-9sv7l", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"developers-runner-fzfqf-9sv7l\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
    /workspace/controllers/runnerreplicaset_controller.go:131
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
2022-11-29T11:13:07Z    ERROR   Reconciler error    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "runnerReplicaSet": {"name":"developers-runner-fzfqf","namespace":"developers-runner"}, "namespace": "developers-runner", "name": "developers-runner-fzfqf", "reconcileID": "ef7ce398-edad-4d15-a648-9058ec541afe", "error": "Operation cannot be fulfilled on runners.actions.summerwind.dev \"developers-runner-fzfqf-9sv7l\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "b2b9c604-9651-4b04-a170-9cd20934c98d", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "b2b9c604-9651-4b04-a170-9cd20934c98d", "allowed": true}
2022-11-29T11:13:07Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "5647d714-c656-4300-b8c9-8628f15e672b", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2022-11-29T11:13:07Z    INFO    runner-resource validate resource to be updated {"name": "developers-runner-fzfqf-27gjv"}

Is anyone else seeing these logs? We're using chart version 0.20.2. We will try upgrading to the latest version today to see if that makes any difference

We are also facing this error, using latest ARC version 0.26.0

cmergenthaler avatar Dec 06 '22 07:12 cmergenthaler

I have also this error with version 0.20.2 but the leaking runners are gone. Looks like it is working how it should.

jbuettnerbild avatar Dec 06 '22 08:12 jbuettnerbild