kapp-controller icon indicating copy to clipboard operation
kapp-controller copied to clipboard

App gets stuck deleting if service account is not created

Open benmoss opened this issue 3 years ago • 15 comments

What steps did you take: Installed the "simple-app" from the instructions without first creating the service account for it. https://github.com/vmware-tanzu/carvel-kapp-controller/blob/develop/docs/walkthrough.md

Tried to delete the app via kapp -a simple-app delete and see it hang/timeout.

What happened:

$ k get app
NAME         DESCRIPTION                                                                                         SINCE-DEPLOY   AGE
simple-app   Delete failed: Preparing kapp: Getting service account: serviceaccounts "default-ns-sa" not found   3s             56m

What did you expect: Delete should succeed

Anything else you would like to add:

Environment:

  • kapp Controller version : v0.16.0
  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

benmoss avatar Feb 25 '21 15:02 benmoss

I have also run into the issue described above in a similar scenario where I have used serviceAccount instead of serviceAccountName. This is definitely something we should look into prioritizing.

Could be similar to #69 but can dig a bit further to see if the issue itself has different nuance to it.

danielhelfand avatar Feb 25 '21 15:02 danielhelfand

Related to #71

vibhas avatar Mar 15 '21 16:03 vibhas

Adding Explore to define/propose a solution.

Reasonable solutions could be:

  • improving the error message that's returned to the user
  • providing documentation about this

aaronshurley avatar Mar 22 '21 22:03 aaronshurley

related: #34

ewrenn8 avatar Mar 24 '21 18:03 ewrenn8

Hey @benmoss,

After some discussion, we've decided the best thing we could do here is to update the error message with a bit more info. Since kapp-controller will be deleting resources, we need a service account that has the roles needed to do the deletion, so that we know it is allowed from an rbac standpoint. This means if the service account can't be found, kapp-controller won't be able to delete things safely.

Going forward, the easiest remedy in this case is to create the service account again. We will be sure to call this information out in the error message when we update it.

Thanks! Eli

ewrenn8 avatar Apr 01 '21 18:04 ewrenn8

We will also explicitly document how to resolve this issue in kapp-controller docs.

As part of documenting how to address this issue, something we will need to call out is what are minimum rbac permissions needed for the serviceaccount to resolve this issue so a user can proceed with deleting the App.

danielhelfand avatar Apr 01 '21 18:04 danielhelfand

If I had to guess why it's not straightforward:

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain. You might also be in a situation where the service account existed, resources were created, and then the service account was deleted, and so knowing the difference would mean caching that state of "i created x, y, z" on the app itself?

benmoss avatar Apr 01 '21 20:04 benmoss

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain.

ah, this is a good case to address.

cppforlife avatar Apr 02 '21 14:04 cppforlife

I will update this issue to call out fixing that specific case and create another one for improving docs or error messages for the case resources do exist.

Problem statement

Kapp controller currently doesn't distinguish between deletes that will delete resources and deletes that just require deleting the App CR. This means, when a service account is missing, kctrl refuses to perform a delete in both cases, even though the former is the only one that requires a service account.

Desired State

kapp controller is able to delete apps that have not had any resources deployed without a service account being present.

Use Case

  1. Deploy kapp-controller
  2. Skip RBAC setup
  3. Deploy simple app example
  4. Verify deploy failed before applying resources
  5. delete the simple app
  6. see that it deletes successfully

ewrenn8 avatar Apr 05 '21 17:04 ewrenn8

We saw another version of this bug when trying to delete a namespace that contains a kapp app and a service account that it refers to.

The whole mess hangs forever after the service account gets deleted before the kapp finalizer can run.

julian-hj avatar Dec 02 '22 00:12 julian-hj

+1 to @julian-hj ... Can this be bumped in priority? This is a serious blocker for customers working at scale. It's not going to be feasible at scale for operators to manually inspect each namespace and delete all kaap apps before deleting the namespace itself.

heyjcollins avatar Mar 03 '23 17:03 heyjcollins

One solution here might be to add the finalizer (which is what's blocking deletion) only after verifying that the service account exists and has access on the first reconciliation.

In combination with #1132, this could introduce the following model:

  • Create an App. Until it has has created at least one child resource, it can simply be deleted.
  • On first reconciliation / install, add a finalizer to the App resource immediately before writing other resources to the cluster, but after validating the service account's read access to the cluster.
    • At the same time, add an ownerReference with blockOwnerDeletion to the ServiceAccount used by the App.†
  • On deletion, the serviceAccount will still be valid (possibly with a deletionTimestamp) when the App is deleted.
  • When changing spec.serviceAccountName, the new ServiceAccount will need to be added as an ownerReference at the same time the old service account is removed. Note that if the new serviceAccountName does not point to a valid service account, we could have the same problem.
  • It's not clear to me what should be done with respect to ownerReferences when spec.serviceAccountName is empty.

† Note that adding this ownerReference may change an App from being an independent object to an object subject to garbage collection, i.e. if the ServiceAccount is deleted. If the App is part of some larger constellation like a PackageInstall, it will have two ownerReferences, and both will need to be deleted.

evankanderson avatar Mar 09 '23 17:03 evankanderson

What is the best way to resolve this once you get into this stat? You can't recreate the service account because the namespace is Terminating so I am not sure how you actually delete the apps to allow the namespace to be deleted.

ryanjbaxter avatar Sep 01 '23 16:09 ryanjbaxter

@ryanjbaxter If you are already in this state, then you can remove the finalizers on the App CR. This might have side effect, if there were any resources created outside the namespace by the App, then those would still remain.

praveenrewar avatar Oct 18 '23 20:10 praveenrewar

So ... I got this wedged pretty good just now when a tanzu package install of cert manager failed on a vCenter 8 TKG 2.2 cluster. Not totally sure that is supposed to work -- but I couldn't edit the finalizers on the App CR because kubectl claimed it couldn't find the app (even though it was letting me edit it). I finally had to issue a "create namespace cert-manager" and THEN kubectl let me edit the app and remove the finalizers. I also had to remove some remaining cert manager service account stuff (clusterrole, maybe some other cluster-scoped stuff) by hand. n.b., you have to "replace" finalizers with an empty array -- just deleting the key will leave the original value there and it will stay stuck.

I'm fairly new to k8s, but this is definitely one of the trickier situations I've had to try to get myself out of. Any additional pointers or automated things that could have been done would have helped me.

EDIT: by the way, for others experiencing this problem, I'll share my root issue as a possible lead (as someone new to k8s, a hint like this would've help me a lot). Deep under the hood, (i.e., PackageInstall -> App -> ... -> ReplicaSet -> ) pods were failing to spin up in the cert-manager namespace due to lack of permissions (the namespace did not allow privileged pods). Instead of trying to "delete" my way out at the namespace level, I gather the "right" way to fix would have been to update the namespace to allow privileged pods (depends on your k8s version, but for me it required adding a label) and then retrying the pod deploy. I think this would have happened automatically on some cadence, but in my case, I was able to speed it up by deleting the failing ReplicaSets, which were then immediately re-created by the parent App and then, since the namespace now had the right permissions, the pods deployed within seconds. Again, being new to k8s, deleting this resource to "retry" seemed pretty scary to me (am I going to break something?), but might be an obvious step to someone more experienced. Anything the tool could have done to give me more hints around figuring this out would have been helpful. e.g., maybe errors from pods could be rolled up somehow and displayed at time of failure vs. me having to "know" where to look for the error I need to fix (some of this isn't really in kapp's domain, to be fair).

ragaskar avatar Jan 13 '24 04:01 ragaskar