operator-sdk
operator-sdk copied to clipboard
Add helm.sdk.operatorframework.io/uninstall-failure-ignore or "uninstall-delete-retries"
Feature Request
Describe the problem you need a feature to resolve.
When your helm chart includes a namespace (and things to go into it), you can run into race conditions on delete.
It manifests in an error similar to this:
{"level":"error","ts":"2023-09-16T11:44:55Z","msg":"Reconciler error","controller":"press-controller","object":{"name":"foo","namespace":"ns2"},"namespace":"ns2","name":"foo","reconcileID":"3a7fda50-4d60-4afc-b958-9d1db3277123","error":"failed to delete release: foo","errorVerbose":"failed to delete release: foo\nhelm.sh/helm/v3/pkg/action.(*Uninstall).Run\n\t/go/pkg/mod/helm.sh/helm/[email protected]/pkg/action/uninstall.go:118\ngithub.com/operator-framework/operator-sdk/internal/helm/release.manager.UninstallRelease\n\t/workspace/internal/helm/release/manager.go:372\ngithub.com/operator-framework/operator-sdk/internal/helm/controller.HelmOperatorReconciler.Reconcile\n\t/workspace/internal/helm/controller/reconcile.go:126\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}
I've currently got that setup with a helm operator that creates a namespace and several objects within it. The delete of the namespace removes many other objects the helm operator is expecting to be able to remove.
This could also be a general helm issue (should remove namespaces last); however, there may be other reasons objects could get permanently stuck deleting and the user may want to just orphan those for easier cleanup than these CRs (adding an overt annotation for this behavior will allow them to make that choice).
Describe the solution you'd like.
Would like to add an annotation to ignore delete failures and/or keep track of the delete retries before removing the finalizer and allowing the CR to be removed from the system and reconciliation.
This may be useful for both helm and ansible.
/language ansible /language helm
@cuppett could you elaborate on:
The delete of the namespace removes many other objects the helm operator is expecting to be able to remove.
Are you deleting the CR entirely, or deleting the namespace first using kubectl delete
. If its the latter, then this is an expected behaviour, since deletion of the namespace, will add a finalizer, try removing the resources in an arbitrary order causing helm to not be able to run the uninstall successfully (as the resources it should be managing would not be present).
This could also be a general helm issue (should remove namespaces last)
If we remove the CR, and leave helm uninstall to do its work, then looks like everything should work as expected. The uninstall order defined in helm to remove the resources has namespaces
to be the last (just before PriorityClass): https://github.com/helm/helm/blob/f902947cb18ec374005390bee097a51221e07140/pkg/releaseutil/kind_sorter.go#L108.
I would like us to dig deeper in the requirement for adding the annotation to ignore failures, if that is something being caused by an unexpected event from the user that hinders the intended helm's uninstall process.
@cuppett could you elaborate on:
The delete of the namespace removes many other objects the helm operator is expecting to be able to remove.
Are you deleting the CR entirely, or deleting the namespace first using
kubectl delete
. If its the latter, then this is an expected behaviour, since deletion of the namespace, will add a finalizer, try removing the resources in an arbitrary order causing helm to not be able to run the uninstall successfully (as the resources it should be managing would not be present).
Deleting an entire CR. It has a namespace, some objects to go in the namespace (Deployments, ConfigMaps) and other CRs.
This could also be a general helm issue (should remove namespaces last)
If we remove the CR, and leave helm uninstall to do its work, then looks like everything should work as expected. The uninstall order defined in helm to remove the resources has
namespaces
to be the last (just before PriorityClass): https://github.com/helm/helm/blob/f902947cb18ec374005390bee097a51221e07140/pkg/releaseutil/kind_sorter.go#L108.
For the most part, this does work well and seem to sequence + exit out clean. I've only noticed a few here and there. The workaround is to simply remove the helm-operator finalizer and then GC wipes out the last object (the CR itself).
I would like us to dig deeper in the requirement for adding the annotation to ignore failures, if that is something being caused by an unexpected event from the user that hinders the intended helm's uninstall process.
Yes, with the info you provided, I can try to pin down the objects, the sequence and follow up with another issue to helm or SDK. The example was more just my specific issue de jour. It could have been many different things.
I think the general utility of the proposed annotations supercedes whatever the specific issue a user could potentially have (and that's why I created the request).
@cuppett I see, thanks for the explanation. If the CR is deleted directly, then there should be a full cleanup eventually, if not this could be a bug on helm's or SDK's side (I'm tempted to say helm, because we offload it to helm uninstall, it should clean up the resources enabling us to remove the CR as finalizer would have run successfully - but this is ideal world scenario). Could you share your project if possible to replicate this issue, so that we could dig more into it?
thanks for the explanation. If the CR is deleted directly, then there should be a full cleanup eventually, if not this could be a bug on helm's or SDK's side (I'm tempted to say helm, because we offload it to helm uninstall, it should clean up the resources enabling us to remove the CR as finalizer would have run successfully - but this is ideal world scenario). Could you share your project if possible to replicate this issue, so that we could dig more into it?
Yes, I'll bifurcate and create a separate issue. Let me try to recreate and capture the log of it. By the time I saw it, the controller pod had restarted, the helm secret was gone, all the objects in the NS were gone and only the CR and parent namespace was still there. I didn't do a deep probe to see if other CRs or what else could still be there gumming it up. Removing the finalizer and manually removing the namespace was enough to mop up.
FWIW, I was able to get back to trying to diagnose the uninstall issue. With 1.32.0, it still fails to uninstall with:
"errorVerbose":"failed to delete release: helm.sh/helm/v3/pkg/action.(*Uninstall).Run\n\t/go/pkg/mod/helm.sh/helm/[email protected]/pkg/action/uninstall.go:118\ngithub.com/operator-framework/operator-sdk/internal/helm/release.manager.UninstallRelease\n\t/workspace/internal/helm/release/manager.go:397\ngithub.com/operator-framework/operator-sdk/internal/helm/controller.HelmOperatorReconciler.Reconcile\n\t/workspace/internal/helm/controller/reconcile.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"
However, with a newer helm, I'm able to remove the release from the CLI:
$ helm version
version.BuildInfo{Version:"v3.13.2", GitCommit:"2a2fb3b98829f1e0be6fb18af2f6599e0f4e8243", GitTreeState:"clean", GoVersion:"go1.20.10"}
Once release is removed via CLI, the operator notices the release is gone and then removes finalizer so object gets removed. It may have been issue with helm itself that they've since resolved.
I noticed in the branch we're at 3.12 (for next release). I can observe again once that is released either at 3.12 or 3.13 and track that separate issue if unique here (separate from request here for the annotation).
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
. Mark the issue as fresh by commenting/remove-lifecycle rotten
. Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.