kapp-controller
kapp-controller copied to clipboard
kapp-controller won't delete an application without valid service account or kubeconfig
What steps did you take:
Create an application that is meant to be deployed to a cluster. Make some mistake that prevents cluster deployment, like for example mistype the secret name. This simple app will do it:
apiVersion: kappctrl.k14s.io/v1alpha1
kind: App
metadata:
name: simple-app
namespace: default
spec:
cluster:
namespace: carvel-apps
kubeconfigSecretRef:
name: make-sure-this-does-not-exist
key: value
fetch:
- git:
url: https://github.com/vmware-tanzu/carvel-simple-app-on-kubernetes
ref: origin/develop
subPath: config-step-2-template
template:
- ytt: {}
deploy:
- kapp: {}
Try to deploy that application with kubectl
kubectl apply -f simple-app.yml
The application will go into a reconciliation failure mode. Finally try to force delete that application:
kubectl delete app.kappctrl.k14s.io/simple-app --force
Your process will get stuck forever or until the application gets manually reconciled.
What happened:
Under certain circumstances, when incurring a failure, kapp-controller will ignore requests to delete the application, even if those requests are explicit about forcing the deletion of the application. kapp-controller will always expect the operator to fix the application first, and only then will proceed with the deletion. It's important to note that this behavior does not happen with all the errors. If, for example, remote deployment fails due to the destination namespace not existing in the target cluster, then kapp-controller will fail but it does not expect the user to fix the problem and delete commands will succeed.
What did you expect:
We would expect kapp-controller to delete the application when instructed to do it with the force flag. I think there might be an argument to keep the existing behavior of blocking CR deletion in certain circumstances to prevent possible inconsistencies. However, when the force flag is explicitly passed, I would expect the controller to go ahead and delete. It is also important to note that in this very simple example I did paste above, there haven't been any resources deployed anywhere. kapp-controller could be smart enough to detect this and proceed with the regular deletion.
Anything else you would like to add:
This blocking on delete behavior works consistently with the declarative mindset. Specially if we assume that there is the figure of an operator that might need to eventually jump in to manually resolve issues or applications that somehow are stuck. However, this does not fit very well the model of an automated deployment system that might deploy hundreds of carvel applications every day and that needs to be resilient to issues that can't be controlled, like for example an user mistyping some configuration, tokens not properly configured and expired, changes in the targets cluster, etc. Systems, by definition, will fail and what would happen today with the current design is that resources would start leaking until someone gets alerted and can manually get involved. And at that moment there could be dozens or hundreds of apps that would need to be manually reconciled.
I believe by adding a way to force deletion alleviates the concerns about the use case above.
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
related issue: https://github.com/vmware-tanzu/carvel-kapp-controller/issues/114
This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.
Hi @mpermar - Thanks for this issue!
I've also wrestled with apps that didn't want to delete due to various reasons and I agree it can be frustrating. I had a quick exchange with @cppforlife which I'll try to summarize:
-
as you said, there's many different ways for things to fail. One concern around providing automation is that the same automation won't be correct for all failure modes, so the safest thing is to require human judgement for failure recovery.
-
if you edit the App CR and set
spec.noopDelete=true
that basically will tell kapp-controller to delete this resource without trying to cascade or clean up other related resources, so that is a sort of declarative "force" flag for cases where the resources themselves have gotten into an inconsistent state. -
kubectl delete ... --force
won't override finalizers, so there already isn't an absolute "just delete this no matter what" experience in the CLI.
Finally, when I tried your example the failure message I got was:
simple-app Reconcile failed: Expected at least one template option
. which did allow for a kubectl delete to succeed. Is there anything else about your environment I would need to reproduce successfully? I was running in a single minikube cluster without the carvel-apps namespace even created, so I think the cluster
block may have been ignored entirely (?)
Finally, when I tried your example the failure message I got was: simple-app Reconcile failed: Expected at least one template option. which did allow for a kubectl delete to succeed.
My fault. I had omitted the ytt/deploy sections. Didn't know those were mandatory but looking at kapp-controller's source code it became evident. I have updated the code in issue description and you should be able to easily reproduce now.
Just a note that we've reproduced this again per the description. @100mik @renuy What's the priority/order you see for putting some kind of a --no-really-i-mean-it
force-delete flag into the CLI?
This is not something we have prioritised right now, but since we have a recurring need for this I believe we can definitely add it to the list of action items in the second milestone 🚀
This seems to be now possible with the new kapp-controller CLI:
bash-3.2$ kctrl app delete -a simple-app -n carvel-test --noop
Target cluster 'https://2907F46F8E6FA8E74DE1BAA5F0113004.gr7.us-east-1.eks.amazonaws.com' (nodes: ip-192-168-138-17.ec2.internal, 3+)
Deleting app 'simple-app' in namespace 'carvel-test'
Continue? [yN]: y
3:50:10PM: Ignoring associated resources for app 'simple-app' in namespace 'carvel-test'
3:50:10PM: Waiting for app deletion for 'simple-app'
3:50:11PM: Waiting for generation 5 to be observed
3:50:11PM: Waiting for generation 5 to be observed
3:50:11PM: Waiting for generation 5 to be observed
3:50:11PM: Waiting for generation 5 to be observed
3:50:11PM: App 'simple-app' in namespace 'carvel-test' deleted
As mentioned, this is now in the kctrl
cli. Closing this issue, thanks for all the input folks!
Question: should Carvel Apps set metadata.ownerReferences
on the App to include the referenced ServiceAccount with blockOwnerDeletion
to prevent deleting the Service Account before the Carvel App is deleted?