operator icon indicating copy to clipboard operation
operator copied to clipboard

Remove finalizer in controller

Open ostromart opened this issue 4 years ago • 8 comments

All the finalizer does is keep ICP CR around until other resources have been deleted. This is nice to have, unfortunately it's prone to races when the operator itself is deleted, since then there's nothing to remove the finalizer. Then we are stuck with a namespace that can't be easily deleted. Removing this for now since this is more of a nice to have than essential feature.

ostromart avatar Dec 02 '19 20:12 ostromart

@ostromart: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
lint_operator 63c78219d28b156ec60784da833d2bab069bb8ff link /test lint_operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

istio-testing avatar Dec 02 '19 20:12 istio-testing

@ostromart can you fix the linting so we can get this work merged for master/1.4.3?

Cheers -steve

sdake avatar Dec 09 '19 04:12 sdake

Cluster scoped resources and resources in other namespaces cannot have owner references, which is why the finalizer would be necessary.

rcernich avatar Dec 11 '19 21:12 rcernich

Folks, thanks for your input and patience with this PR. We've spent a lot of cycles debating and I think in the process our understanding of the situation has improved a great deal (at least mine has). I had some additional discussion with @ayj to see if there was some other way we could work around the stuck finalizer problem. He suggested a preStop hook which indeed I think could work but it would be at the cost of additional complexity. One thing that's clear to me now is that the PR will not work in the current state because deletion timestamp is not created without a finalizer. So at least a change is needed to test for the non-existence of any CR in a watched namespace and treat that as a delete event, rather than using deletion timestamp. But if I add that change I think this is a simpler solution than adding a preStop hook. wdyt?

ostromart avatar Dec 12 '19 00:12 ostromart

Could you elaborate on the "stuck finalizer" problem? In the original comment, you say that the resource remains when the operator is deleted. The simple answer is: don't delete the operator.

I also wonder if this is mainly an issue when the ICP resource is collocated with the operator and the operator namespace itself is deleted (the only workaround for this use case is to manually remove the finalizer from the resource). The solution here is: don't install a control plane in the operator namespace. This is something that shouldn't be done anyway, as the operator namespace should not be accessible to regular users, because the operator is running with elevated privileges.

To clarify the operator usage model:

The operator performs tasks that a cluster-admin (or some other user with elevated privileges) would normally perform. For us, this is installing and configuring a control plane based on an ICP resource.

The operator is installed by cluster-admin and should not be accessible to other users. When using something like operator lifecycle manager (OLM), cluster-admin can approve which operators can be installed by users, allowing individual users to install and update operators. This allows a regular user to install the operator, without having cluster-admin privileges (OLM is actually performing the install/management of the operator).

The operator's lifecycle extends beyond the lifecycle of any given control plane. Currently, this may seem a little confusing, as we only support installation of a single control plane, but once multi-tenancy is supported it will make more sense.

To wrap up:

  • The operator should not be deleted while any ICP resources exist.
  • ICP resources should not be created in the same namespace as the operator.
  • Only cluster-admin should have access to the operator namespace.
  • Using OLM simplifies this entire process.

rcernich avatar Dec 12 '19 16:12 rcernich

The simple answer is: don't delete the operator.

We don't have that option. Users tend to not follow instructions which is clearly illustrated by the fact that this issue has happened several times leaving users very frustrated. Users will tolerate not everything being cleaned up and having to remove some stuff manually, but they will get very upset about being left with a namespace they can't delete. It's particularly bad because it's close to impossible for an Istio user to know that a finalizer on ICP is blocking their namespace deletion. This issue is a must fix for 1.5.

I also wonder if this is mainly an issue when the ICP resource is collocated with the operator and the operator namespace itself.

The same thing happens with ICP in istio-system. The issue is that if the operator is deleted before it's removed the finalizer, there's nothing left to remove that finalizer and the namespace is stuck.

Only cluster-admin should have access to the operator namespace.

In many cases the user is the cluster admin. Multi-tenant is an important use case but not the main Istio use case.

Using OLM simplifies this entire process.

There are no plans to recommend OLM for Istio OSS project.

If someone wants to take on implementing a fix using preStop hook or propose some other way to fix this issue I'm completely open to it. However I strongly disagree that doing nothing and leaving things as they are leads to an acceptable user experience.

ostromart avatar Dec 12 '19 19:12 ostromart

Don't delete the operator is a requirement if you're using the operator.

If you're talking about a single user, who is doing everything as cluster-admin, with a single install, no multi-tenancy, and they only care about install/delete, then there's not really any reason to use the operator.

However, once the operator becomes more functional (e.g. managing certs used by webhooks), the operator is a requirement for a control plane to function normally. Deleting the operator would then require a human operator to take over the tasks that were being managed by the automated operator. That means updating the keys/certs used by the webhooks, cleaning up the control plane resources, removing the finalizer, etc. This is true whether OLM is used or not.

If this is an issue, then it should be highlighted in the documentation. The documentation should also state that the operator does more than facilitate installs (once it does more).

JFYI, the reason I mentioned cluster-admin at all is that I think it's important to clarify the scope required to use various components. For a dev testing this out with minikube, they'll have no problems. However, if they're using a managed cluster, whoever is managing that cluster will care, so it's important to specify what permissions are required for the various aspects (cluster-admin to install/manage the operator, while general user can install and manage the control plane). FWIW, this issue, moving from full access dev env to production was mentioned in a talk at KubeCon. Everything worked fine on dev machines, but the higher level security in production prevented installation of their control plane, let alone the applications in the mesh. One of the disadvantages of doing all your dev with cluster-admin.

rcernich avatar Dec 12 '19 20:12 rcernich

@ostromart: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

istio-testing avatar Feb 12 '20 17:02 istio-testing