actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

ARC Finalizers Cause a deadlock when uninstalling AutoscalingRunnerSet in ArgoCD

Open rteeling-evernorth opened this issue 10 months ago • 16 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.0

Deployment Method

ArgoCD

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install the ARC AutoscalingRunnerSet as an ArgoCD app
2. Uninstall ArgoCD app

Describe the bug

The argocd app cannot be deleted because Argo tries to delete resources that are normally deleted by the ARC controller and normal deletion is blocked by the actions.github.com/cleanup-protection finalizer including:

  • AutoscalingRunnerSet
  • kube mode service account
  • kube mode role
  • rs manager role
  • kube mode role binding
  • rs manager role binding

Describe the expected behavior

The chart should cleanly uninstall when i delete the argocd app

Additional Context

I suspect this may be resolvable by helm/argo annotations for the affected resources, I will test with a fork of the helm chart

Controller Logs

My employer's open source contribution policy forbids me from creating public Gists. I will provide redacted logs upon request via zip/tarball.

Runner Pod Logs

My employer's open source contribution policy forbids me from creating public Gists. I will provide redacted logs upon request via zip/tarball.

rteeling-evernorth avatar Apr 15 '24 19:04 rteeling-evernorth

I was successfully able to fix the issue using argocd annotations argocd.argoproj.io/sync-options: Delete=false and argocd.argoproj.io/sync-wave: "1", I'll open a PR

rteeling-evernorth avatar Apr 16 '24 15:04 rteeling-evernorth

@nikola-jokic could I trouble you for your two cents on this?

rteeling-evernorth avatar Apr 19 '24 16:04 rteeling-evernorth

@rteeling-evernorth Thanks for making an issue for this - I've run into this issue on nearly every upgrade and end up having to turn off syncing, manually delete the finalizers, and then turn syncing back on and it ends up sorting itself out. I also have noticed that even if a new controller comes up (say 0.9.1 when upgrading from 0.9.0) it doesn't seem like that new controller properly handles the finalizers on the scale-set resources that are on the old version.

I do use sync-waves on the top level resources to ensure the gha-runner-scale-set-controller syncs first and the gha-runner-scale-set-figure-linux-standard syncs second:

metadata:
  name: gha-runner-scale-set-controller
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1" # before scale-set

---

metadata:
  name: gha-runner-scale-set-figure-linux-standard
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "2" # after controller

Looking at your PR you set argocd.argoproj.io/sync-wave: "1" - would that conflict in my case with my controller having the same sync wave? Would it makes sense to go negative on the sync-wave for the autoscalingrunnerset to always ensure that runs first?

ahatzz11 avatar Apr 23 '24 22:04 ahatzz11

@ahatzz11 Had the same issue with having to manually resolve the finalizer deadlock. I've installed the controller in an entirely separate ArgoCD app, since the cardinality of the scale set controller to scale sets is 1:N. I assume GitHub is distributing the controller and runner set charts separately because of this. You should separate them into separate ArgoCD apps for the sake of ease of maintenance if nothing else.

Argo Sync waves are only relevant within the context of an Argo App, so as long as the two charts are managed by separate apps, you won't see any conflict. Once this is done you just uninstall any app for the scale set, then uninstall the controller and you won't have any deadlock issues.

Hope that helps!

rteeling-evernorth avatar Apr 24 '24 14:04 rteeling-evernorth

@rteeling-evernorth Cool thanks for the explanation - I do have each chart managed as separate apps so we should be good to go there!

ahatzz11 avatar Apr 24 '24 20:04 ahatzz11

@nikola-jokic

What are your thoughts on this issue and corresponding PR?

rteeling-evernorth avatar Apr 30 '24 14:04 rteeling-evernorth

@nikola-jokic @Link- Is it possible to get this issue/PR looked at for the next release? It would be awesome to have upgrades massively simplified for argocd users.

ahatzz11 avatar May 06 '24 12:05 ahatzz11

Thanks for adding the appropriate labels. It's on our radar now.

Link- avatar May 06 '24 12:05 Link-

@ahatzz11 did my change work for you? I'd prefer to not cause an "It works on my machine" sort of problem.

rteeling-evernorth avatar May 15 '24 20:05 rteeling-evernorth

@Link- @nikola-jokic Just curious - what's going on with this? The associated PR has been sitting in the hopper for a while now, is there anything i can do within reason to help get this released?

rteeling-evernorth avatar Jun 04 '24 14:06 rteeling-evernorth

@nikola-jokic @Link- @rentziass

This issue has been sitting stale for a few months now. Would it be possible to get the associated PR on the next release?

rteeling-evernorth avatar Jun 18 '24 14:06 rteeling-evernorth

Hey everyone,

After discussing this issue with the team, we decided not to apply annotations at this time. We are aware that many of you are using ArgoCD to manage ARC and scale sets, but we don't have capacity to support it at this time. We would recommend that you maintain your own chart if it requires such modifications until we officially support it. Thank you @rteeling-evernorth, for raising the PR to fix it, and thank you for describing this problem so thoroughly! It will help other people apply changes to their own charts.

nikola-jokic avatar Jun 21 '24 09:06 nikola-jokic

@nikola-jokic, @rteeling-evernorth ,

Since there is no intention no provide the capacity on ArgoCD. Could a possibility be around the support of annotations on all Kubernetes objects managed by the chart?

In the end, if we are able to set the annotations everywhere, it could be handled in the values, without requiring to fork the chart..

AlexandreODelisle avatar Jun 26 '24 12:06 AlexandreODelisle

@nikola-jokic, @rteeling-evernorth ,

Since there is no intention no provide the capacity on ArgoCD. Could a possibility be around the support of annotations on all Kubernetes objects managed by the chart?

In the end, if we are able to set the annotations everywhere, it could be handled in the values, without requiring to fork the chart..

This is the usual solution - giving the choice of annotations to the chart user. By my opinion - best solution for this kind of a problem. :+1:

SimonWoidig avatar Jun 26 '24 12:06 SimonWoidig

The problem here is that the argoCD annotations are not applied to everything - because this is ultimately an order of operations issue the ability to selectively apply the annotations to specific resources is needed. I agree a custom annotation value would be useful but it would not solve this issue.

Additionally, given the argo annotations are effectively inert if the resource is not actively being managed by argo, I dont see why this would have any adverse effect if installing via kubectl/helm/flux/whatever else. If we really want to split hairs, i can update PR to put the argo annotations on a feature flag if that would make the maintainers happy. Thoughts @nikola-jokic ?

rteeling-evernorth avatar Jul 09 '24 15:07 rteeling-evernorth

I use Helm + TF to deploy two separate apps into Argo. If I want to destroy the runner set, I get TF timeouts due to finalizers so I have to manually run the following commands to be able to delete the arc-runners Namespace.

I did add the argocd.argoproj.io/sync-options: Delete=false which is able to "remove" the app from argo but my TF seems to hang on the deleting on the namespace - which is created using the CreateNamespace=true sync option.

kubectl patch autoscalingrunnerset arc-runner-set --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

kubectl patch rolebinding arc-runner-set-gha-rs-manager --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
 
kubectl patch role arc-runner-set-gha-rs-manager --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
 
kubectl patch serviceaccount arc-runner-set-gha-rs-no-permission --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

I could, have TF (not Argo) create the NS, wrap these 4 patches in TF wrapper and then delete the NS as a workaround during a destroy. However I am keen to to see if there is another way.

lifeofmoo avatar Jul 25 '24 14:07 lifeofmoo