operator-lifecycle-manager CSV service account cluster permissions revoked too early

Bug Report

What did you do?

I am experimenting with the olm based operator shutdown/cleanup process. A runnable is added to the controller runtime manager (used to implement the operator) that also requires leader election. This runnable blocks until the manager receives a SIGTERM (ctx cancelation is used by the mgr) and deletes some external resource (in a different namespace than the operator). To manage the external resource a call is made to K8s apiserver using the operator's service account. That service account has several rights. However when the related operator CVS is deleted (teardown) and the runnable is waited (manager waits for that) to finish before the pod is terminated I get the following error (last thing seen in logs):

{"level":"info","ts":"2021-07-05T19:21:03.723Z","logger":"cmd","msg":"Deleting the Knative health dashboard"}
{"level":"error","ts":"2021-07-05T19:21:03.726Z","logger":"cmd","msg":"Deleting the health dashboard failed","error":"configmaps \"grafana-dashboard-definition-knative-health\" is forbidden: User \"system:serviceaccount:openshift-serverless:knative-operator\" cannot delete resource \"configmaps\" in API group \"\" in the namespace \"openshift-config-managed\"","stacktrace":"main.cleanupRunnable.Start\n\t/go/src/github.com/openshift-knative/serverless-operator/knative-operator/cmd/manager/main.go:155\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/go/src/github.com/openshift-knative/serverless-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:681"}
E0705 19:21:03.981627       1 leaderelection.go:325] error retrieving resource lock openshift-serverless/knative-serving-openshift-lock: configmaps "knative-serving-openshift-lock" is forbidden: User "system:serviceaccount:openshift-serverless:knative-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-serverless"
I0705 19:21:04.223592       1 leaderelection.go:278] failed to renew lease openshift-serverless/knative-serving-openshift-lock: timed out waiting for the condition
{"level":"error","ts":"2021-07-05T19:21:04.223Z","logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"context canceled"}
{"level":"error","ts":"2021-07-05T19:21:04.223Z","logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"leader election lost"}
rpc error: code = NotFound desc = could not find container "71710bf12ffe7dd76be0be24bb3bafc5f05dd07d5d09a8acfc6e2d48c2e520de": container with ID starting with 71710bf12ffe7dd76be0be24bb3bafc5f05dd07d5d09a8acfc6e2d48c2e520de not found: ID does not exist

What did you expect to see?

Cluster permissions not being revoked since they might be needed by the operator during a graceful shutdown. If I just restart the pod that runnable does what is expected to do.

What did you see instead? Under which circumstances?

Resources like rbac rights being removed.

Environment

operator-lifecycle-manager version: I am on OCP 4.7.12.
Kubernetes version information:

I am on OCP 4.7.12.

Kubernetes cluster kind:

Possible Solution

Additional context

Jul 05 '21 21:07 skonto

Is the issue that OLM removing the operator pod and RBAC associated with the operator before the pod is finished terminating, which causes errors that you are seeing above in your operator logs? In this case, OLM should not GC those resources before the pod has been removed. This indicates an issue with the CSV GC logic in the olm operator.

Currently there is no ordering in child object removal when the CSV is removed. We rely on kube GC to remove all dependent objects when the CSV is removed, so the ordering is non-deterministic.

There are several potential solutions here and we will continue to triage further.

Jul 08 '21 14:07 exdx

@exdx yes afaik that is the problem.

Aug 02 '21 12:08 skonto

@timflannagan is going to create an RFE on the OF Proposal Board that focuses on improving OLM so that it does not delete resources in a random order.

Aug 05 '21 14:08 awgreene

I am removing this from the 0.20.0 release as it likely will not make it in that quickly.

Aug 05 '21 14:08 awgreene

@awgreene @timflannagan did the RFE make it to the proposal board yet? This seems like a fundamental requirement for ordered tear-down of an operator.

Nov 26 '21 15:11 dmesser

In the meantime, a workaround that I suggested in today's working group meeting is that you could set a finalizer on the RBAC roles/rolebindings your operator needs to exist during its cleanup.

Your operator would need permission to set and remove those finalizers, which you could grant with this RBAC rule.

- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles/finalizers
  - rolebindings/finalizers
  - clusterroles/finalizers
  - clusterrolebindings/finalizers
  verbs:
  - update

Obviously this grants your operator otherwise unnecessary permissions, which if abused (e.g. via a vulnerability) could enable an attacker to add finalizers to all (cluster)role(binding)s and thus block their deletion.

An even better solution (which is still in an unmerged Kubernetes EP) would be for OLM to apply the "in use" protection mechanism to prevent deletion of resources used by the operator deployment: https://github.com/kubernetes/enhancements/blob/b873c9966236ca3eace2ce16666b342255e50a6e/keps/sig-api-machinery/2839-in-use-protection/README.md

Longer term, I think the answer is that OLM needs to become less opinionated about the mechanics of install/uninstall and give more power and flexibility to operator authors. For example, we are investigating support for a plain manifest bundle format in which operator authors can directly include static manifests that could contain finalizers and in-use protection mechanisms they need to ensure an orderly deletion.

Feb 10 '22 22:02 joelanford

I will try the workaround and see how it goes.

Feb 11 '22 11:02 skonto

We're hitting the same issue in the sriov-operator. We are trying to clean up the env when the operator is removed, but we the required permissions are getting removed before our pods finish the cleanup. The suggested fix will unfortunately not solve the issue completely. It will allow the cleanup to happen, but at the end we'll end up with a role and rolebinding, from which we will have to remove the finalizers. We will have to remove it from one resource first (rolebinding), and then we will not have the permissions to remove the other (role), because the rolebinding will not be there. Please let me know if I'm missing something here.

Feb 23 '22 13:02 mmirecki

Tried out the above solution, and it looks like it behave just as described. We loose the permissions right after removing either the role or rolebinding.

Feb 24 '22 13:02 mmirecki

operator-lifecycle-manager operator-lifecycle-manager copied to clipboard

CSV service account cluster permissions revoked too early

Bug Report

operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard