actions-runner-controller
actions-runner-controller copied to clipboard
ARC Finalizers Cause a deadlock when uninstalling AutoscalingRunnerSet in ArgoCD
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.0
Deployment Method
ArgoCD
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Install the ARC AutoscalingRunnerSet as an ArgoCD app
2. Uninstall ArgoCD app
Describe the bug
The argocd app cannot be deleted because Argo tries to delete resources that are normally deleted by the ARC controller and normal deletion is blocked by the actions.github.com/cleanup-protection
finalizer
including:
- AutoscalingRunnerSet
- kube mode service account
- kube mode role
- rs manager role
- kube mode role binding
- rs manager role binding
Describe the expected behavior
The chart should cleanly uninstall when i delete the argocd app
Additional Context
I suspect this may be resolvable by helm/argo annotations for the affected resources, I will test with a fork of the helm chart
Controller Logs
My employer's open source contribution policy forbids me from creating public Gists. I will provide redacted logs upon request via zip/tarball.
Runner Pod Logs
My employer's open source contribution policy forbids me from creating public Gists. I will provide redacted logs upon request via zip/tarball.
I was successfully able to fix the issue using argocd annotations argocd.argoproj.io/sync-options: Delete=false
and argocd.argoproj.io/sync-wave: "1"
, I'll open a PR
@nikola-jokic could I trouble you for your two cents on this?
@rteeling-evernorth Thanks for making an issue for this - I've run into this issue on nearly every upgrade and end up having to turn off syncing, manually delete the finalizers, and then turn syncing back on and it ends up sorting itself out. I also have noticed that even if a new controller comes up (say 0.9.1
when upgrading from 0.9.0
) it doesn't seem like that new controller properly handles the finalizers on the scale-set resources that are on the old version.
I do use sync-waves
on the top level resources to ensure the gha-runner-scale-set-controller
syncs first and the gha-runner-scale-set-figure-linux-standard
syncs second:
metadata:
name: gha-runner-scale-set-controller
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1" # before scale-set
---
metadata:
name: gha-runner-scale-set-figure-linux-standard
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "2" # after controller
Looking at your PR you set argocd.argoproj.io/sync-wave: "1"
- would that conflict in my case with my controller having the same sync wave? Would it makes sense to go negative on the sync-wave for the autoscalingrunnerset to always ensure that runs first?
@ahatzz11 Had the same issue with having to manually resolve the finalizer deadlock. I've installed the controller in an entirely separate ArgoCD app, since the cardinality of the scale set controller to scale sets is 1:N. I assume GitHub is distributing the controller and runner set charts separately because of this. You should separate them into separate ArgoCD apps for the sake of ease of maintenance if nothing else.
Argo Sync waves are only relevant within the context of an Argo App, so as long as the two charts are managed by separate apps, you won't see any conflict. Once this is done you just uninstall any app for the scale set, then uninstall the controller and you won't have any deadlock issues.
Hope that helps!
@rteeling-evernorth Cool thanks for the explanation - I do have each chart managed as separate apps so we should be good to go there!
@nikola-jokic
What are your thoughts on this issue and corresponding PR?
@nikola-jokic @Link- Is it possible to get this issue/PR looked at for the next release? It would be awesome to have upgrades massively simplified for argocd users.
Thanks for adding the appropriate labels. It's on our radar now.
@ahatzz11 did my change work for you? I'd prefer to not cause an "It works on my machine" sort of problem.
@Link- @nikola-jokic Just curious - what's going on with this? The associated PR has been sitting in the hopper for a while now, is there anything i can do within reason to help get this released?
@nikola-jokic @Link- @rentziass
This issue has been sitting stale for a few months now. Would it be possible to get the associated PR on the next release?
Hey everyone,
After discussing this issue with the team, we decided not to apply annotations at this time. We are aware that many of you are using ArgoCD to manage ARC and scale sets, but we don't have capacity to support it at this time. We would recommend that you maintain your own chart if it requires such modifications until we officially support it. Thank you @rteeling-evernorth, for raising the PR to fix it, and thank you for describing this problem so thoroughly! It will help other people apply changes to their own charts.
@nikola-jokic, @rteeling-evernorth ,
Since there is no intention no provide the capacity on ArgoCD. Could a possibility be around the support of annotations on all Kubernetes objects managed by the chart?
In the end, if we are able to set the annotations everywhere, it could be handled in the values, without requiring to fork the chart..
@nikola-jokic, @rteeling-evernorth ,
Since there is no intention no provide the capacity on ArgoCD. Could a possibility be around the support of annotations on all Kubernetes objects managed by the chart?
In the end, if we are able to set the annotations everywhere, it could be handled in the values, without requiring to fork the chart..
This is the usual solution - giving the choice of annotations to the chart user. By my opinion - best solution for this kind of a problem. :+1:
The problem here is that the argoCD annotations are not applied to everything - because this is ultimately an order of operations issue the ability to selectively apply the annotations to specific resources is needed. I agree a custom annotation value would be useful but it would not solve this issue.
Additionally, given the argo annotations are effectively inert if the resource is not actively being managed by argo, I dont see why this would have any adverse effect if installing via kubectl/helm/flux/whatever else. If we really want to split hairs, i can update PR to put the argo annotations on a feature flag if that would make the maintainers happy. Thoughts @nikola-jokic ?
I use Helm + TF to deploy two separate apps into Argo. If I want to destroy the runner set, I get TF timeouts due to finalizers so I have to manually run the following commands to be able to delete the arc-runners Namespace.
I did add the argocd.argoproj.io/sync-options: Delete=false which is able to "remove" the app from argo but my TF seems to hang on the deleting on the namespace - which is created using the CreateNamespace=true sync option.
kubectl patch autoscalingrunnerset arc-runner-set --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
kubectl patch rolebinding arc-runner-set-gha-rs-manager --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
kubectl patch role arc-runner-set-gha-rs-manager --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
kubectl patch serviceaccount arc-runner-set-gha-rs-no-permission --namespace=arc-runner-set --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
I could, have TF (not Argo) create the NS, wrap these 4 patches in TF wrapper and then delete the NS as a workaround during a destroy. However I am keen to to see if there is another way.