kapp-controller
kapp-controller copied to clipboard
App gets stuck deleting if service account is not created
What steps did you take: Installed the "simple-app" from the instructions without first creating the service account for it. https://github.com/vmware-tanzu/carvel-kapp-controller/blob/develop/docs/walkthrough.md
Tried to delete the app via kapp -a simple-app delete
and see it hang/timeout.
What happened:
$ k get app
NAME DESCRIPTION SINCE-DEPLOY AGE
simple-app Delete failed: Preparing kapp: Getting service account: serviceaccounts "default-ns-sa" not found 3s 56m
What did you expect: Delete should succeed
Anything else you would like to add:
Environment:
- kapp Controller version : v0.16.0
- Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
I have also run into the issue described above in a similar scenario where I have used serviceAccount
instead of serviceAccountName
. This is definitely something we should look into prioritizing.
Could be similar to #69 but can dig a bit further to see if the issue itself has different nuance to it.
Related to #71
Adding Explore
to define/propose a solution.
Reasonable solutions could be:
- improving the error message that's returned to the user
- providing documentation about this
related: #34
Hey @benmoss,
After some discussion, we've decided the best thing we could do here is to update the error message with a bit more info. Since kapp-controller will be deleting resources, we need a service account that has the roles needed to do the deletion, so that we know it is allowed from an rbac standpoint. This means if the service account can't be found, kapp-controller won't be able to delete things safely.
Going forward, the easiest remedy in this case is to create the service account again. We will be sure to call this information out in the error message when we update it.
Thanks! Eli
We will also explicitly document how to resolve this issue in kapp-controller docs.
As part of documenting how to address this issue, something we will need to call out is what are minimum rbac permissions needed for the serviceaccount to resolve this issue so a user can proceed with deleting the App.
If I had to guess why it's not straightforward:
Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain. You might also be in a situation where the service account existed, resources were created, and then the service account was deleted, and so knowing the difference would mean caching that state of "i created x, y, z" on the app itself?
Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain.
ah, this is a good case to address.
I will update this issue to call out fixing that specific case and create another one for improving docs or error messages for the case resources do exist.
Problem statement
Kapp controller currently doesn't distinguish between deletes that will delete resources and deletes that just require deleting the App CR. This means, when a service account is missing, kctrl refuses to perform a delete in both cases, even though the former is the only one that requires a service account.
Desired State
kapp controller is able to delete apps that have not had any resources deployed without a service account being present.
Use Case
- Deploy kapp-controller
- Skip RBAC setup
- Deploy simple app example
- Verify deploy failed before applying resources
- delete the simple app
- see that it deletes successfully
We saw another version of this bug when trying to delete a namespace that contains a kapp app and a service account that it refers to.
The whole mess hangs forever after the service account gets deleted before the kapp finalizer can run.
+1 to @julian-hj ... Can this be bumped in priority? This is a serious blocker for customers working at scale. It's not going to be feasible at scale for operators to manually inspect each namespace and delete all kaap apps before deleting the namespace itself.
One solution here might be to add the finalizer
(which is what's blocking deletion) only after verifying that the service account exists and has access on the first reconciliation.
In combination with #1132, this could introduce the following model:
- Create an
App
. Until it has has created at least one child resource, it can simply be deleted. - On first reconciliation / install, add a
finalizer
to theApp
resource immediately before writing other resources to the cluster, but after validating the service account's read access to the cluster.- At the same time, add an
ownerReference
withblockOwnerDeletion
to the ServiceAccount used by the App.†
- At the same time, add an
- On deletion, the
serviceAccount
will still be valid (possibly with adeletionTimestamp
) when theApp
is deleted. - When changing
spec.serviceAccountName
, the new ServiceAccount will need to be added as anownerReference
at the same time the old service account is removed. Note that if the newserviceAccountName
does not point to a valid service account, we could have the same problem. - It's not clear to me what should be done with respect to
ownerReferences
whenspec.serviceAccountName
is empty.
† Note that adding this ownerReference
may change an App
from being an independent object to an object subject to garbage collection, i.e. if the ServiceAccount is deleted. If the App
is part of some larger constellation like a PackageInstall
, it will have two ownerReference
s, and both will need to be deleted.
What is the best way to resolve this once you get into this stat? You can't recreate the service account because the namespace is Terminating
so I am not sure how you actually delete the apps to allow the namespace to be deleted.
@ryanjbaxter If you are already in this state, then you can remove the finalizers on the App CR. This might have side effect, if there were any resources created outside the namespace by the App, then those would still remain.
So ... I got this wedged pretty good just now when a tanzu package install of cert manager failed on a vCenter 8 TKG 2.2 cluster. Not totally sure that is supposed to work -- but I couldn't edit the finalizers on the App CR because kubectl claimed it couldn't find the app (even though it was letting me edit it). I finally had to issue a "create namespace cert-manager" and THEN kubectl let me edit the app and remove the finalizers. I also had to remove some remaining cert manager service account stuff (clusterrole, maybe some other cluster-scoped stuff) by hand. n.b., you have to "replace" finalizers with an empty array -- just deleting the key will leave the original value there and it will stay stuck.
I'm fairly new to k8s, but this is definitely one of the trickier situations I've had to try to get myself out of. Any additional pointers or automated things that could have been done would have helped me.
EDIT: by the way, for others experiencing this problem, I'll share my root issue as a possible lead (as someone new to k8s, a hint like this would've help me a lot). Deep under the hood, (i.e., PackageInstall -> App -> ... -> ReplicaSet -> ) pods were failing to spin up in the cert-manager
namespace due to lack of permissions (the namespace did not allow privileged pods). Instead of trying to "delete" my way out at the namespace level, I gather the "right" way to fix would have been to update the namespace to allow privileged pods (depends on your k8s version, but for me it required adding a label) and then retrying the pod deploy. I think this would have happened automatically on some cadence, but in my case, I was able to speed it up by deleting the failing ReplicaSets, which were then immediately re-created by the parent App and then, since the namespace now had the right permissions, the pods deployed within seconds. Again, being new to k8s, deleting this resource to "retry" seemed pretty scary to me (am I going to break something?), but might be an obvious step to someone more experienced. Anything the tool could have done to give me more hints around figuring this out would have been helpful. e.g., maybe errors from pods could be rolled up somehow and displayed at time of failure vs. me having to "know" where to look for the error I need to fix (some of this isn't really in kapp's domain, to be fair).