java-operator-sdk icon indicating copy to clipboard operation
java-operator-sdk copied to clipboard

Umbrella issue for handling Informer Permission Errors and Probes

Open csviri opened this issue 3 years ago • 16 comments

This issue is to cover more related issues and come up with a design how to handle them, see:

  • [ ] #1419
  • [ ] #1405
  • [ ] #1412

Summary

Informers are by definition not working if related RBAC is not configured or miss configured.

  1. Either when an operator starts up and there are no proper permissions or
  2. when running and some permissions are revoked.

This is true both for informers for primary and secondary resources.

When a permission is revoked what happens is that Informer (from fabri8 client) will try retry few times but eventually just stops. (TODO verify this behavior, also configurability)

Note that framework however supports dynamically changing watched namespaces, so it can again happen a new namespace added but there are no permissions. It might be desirable to have the operator functioning on the former namespaces.

There are multiple ways to solve this:

  1. Just stop/restart the operator when an informer is not started properly
  2. Try to reconnect the informer (infinitely)
  3. Expose metrics regarding the health of informers so readyness probes can be added that eventually kill the operator.

@andreaTP propsed that having a callback function might be useful too.

csviri avatar Aug 25 '22 11:08 csviri

After thinking more of this issue, there is a serious problem when an informer is not working. The caches are not populated with the fresh resources, so until the RBAC issues is not solved it is basically could be inconsistent. For example there is a new primary resource that on reconciliation tries to create a secondary resource, if the informer of that secondary resource is not working the cache will not know about the resource so the reconciler will try to recreate it over and over again.

Therefore would lean in the first phase just to restart the operator if there is such an error with the Informer. Maybe allowing to users to configure some retry options for the informer before a stop.

There could be a feature flag to not automatically shut down the operator, but delegate this for probes and expose the health metrics of informers as mentioned in point 3.

csviri avatar Aug 25 '22 12:08 csviri

For the case when dynamically adding watched namespaces, there could be some optimizations, like reconcile a resource only when all the informers watching resources from a namespaces are healthy. But these will be very specific cases, its questionable such features is worth the effort.

csviri avatar Aug 25 '22 12:08 csviri

I think when the informer for a specific namespace cannot be created, we need to remove that namespace from the config and clear the caches.

The general problem with the restarting/current behaviour on errors is that malicious users (or simple user error) can cripple the operator for everyone in the cluster. If the user deletes rolebinding in their own namespaces the operator will go in a restart loop and there is nothing anyone else can do to fix it.

The whole cluster and resources are affected this way and restart won't do any good.

gyfora avatar Aug 25 '22 12:08 gyfora

@gyfora So you proposing something like: if a new namespace is added dynamically in runtime, and there is an informer (among possible more new informers), that cannot be start because of some permission issues, basically this operation should reject the new namespace (throw an exception - we could check with k8s api is there such permission upfront) ?

How should we behave on startup? Do the checks also I guess on watched namespace. So if I understand correctly you are saying that we should be kinda validate and/or pre-process configuration?

This basically could be done just for informers which are following the general namespace configuration. I guess for the others if failing (informers not following the namespace in primary controller config) we should basically stop the operator too.

Do you agree with the case when a permission revoked, and basically from that point one of the informers won't work, therefore the caches won't function properly, that we should restart the operator?

csviri avatar Aug 25 '22 13:08 csviri

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

csviri avatar Aug 25 '22 13:08 csviri

No I am not really proposing anything about newly added namespaces.

What I am proposing if at any time any watched namespaces becomes unaccessible we should be able to remove that from the list and clean up caches and continue with the rest.

I think the best would be to have this at least a configurable option (default could be fail-operator/restart as you proposed).

Basically we need to address the following situation robustly:

  1. watchedNamespaces: userNs1, userNs2
  2. User 1 decides to revoke permissions for userNs1
  3. User 2 using userNs2 should not be affected by the action of user1 as they are not aware of each other

gyfora avatar Aug 25 '22 13:08 gyfora

Unless this problem is addressed there is a big risk for any operator deployment that watches multiple independent user namespaces. One user basically can send the whole operator in a restart loop, this is not something that many production platforms can afford

gyfora avatar Aug 25 '22 13:08 gyfora

In some cases it is better to not watch a namespace anymore than to affect other namespaces being watched.

gyfora avatar Aug 25 '22 13:08 gyfora

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

Yes I guess if there is a liveness probe to restart then we as users of the framework could validate permissions. That would definitely be a step forward.

gyfora avatar Aug 25 '22 13:08 gyfora

ok I understand what you mean. Good point.

However consider a situation: somebody accidentally mis-configures the rbac, and revokes just permission of a single resource in a namespace, while the other resources and related informers will still work on that namespace. Should we remove all the informers? if not than we have the mentioned problem above with the not fresh caches.

But it can also happen that that particular namespace was a special one where some configs for the operator ( not a team namespace).

So what I mean there is now no such thing in the model of the operator that Informers that are bound together - meaning if one of the stopped other ones should be stopped.

There is this flag what means informers should follow namespace changes of the controller. Theoretically we could say: Stop all the informers that follows the controller namespaces if one of them fails (on a namespace). (Or just stop propagating event from them, and try to reconnect the one what is failing. Start all of them again if all up and running. )

This would make sense for me for the multi namespace mode. (Not the whole cluster or one namespace mode, there I think we still need to stop.)

csviri avatar Aug 25 '22 13:08 csviri

The other alternative to solve both the issues, just retry staring the informer with some exponential backoff, and log warning. Assuming that it won't reconcile the resource anyways since there will be new event from that namespace anyways.

And admins will eventually remove the revoked namespace from the configuration.

csviri avatar Aug 25 '22 13:08 csviri

Yes I think the exponential backoff + warning is a good approach. In that case other informers/namespaces would not be affected.

gyfora avatar Aug 25 '22 14:08 gyfora

Yes, that will cover this case in a relatively simple way.

But would put there a feature flag, by default will exit if one of the informers is stopped.

And later we can improve this by metrics and readyness probes.

cc @andreaTP

csviri avatar Aug 25 '22 14:08 csviri

sound great :)

gyfora avatar Aug 25 '22 14:08 gyfora

As discussed during the community meeting, here is all the PoC caching stuff I have been working on for how Go operators can handle dynamically changing permissions via RBAC:

everettraven avatar Aug 25 '22 15:08 everettraven

Another edge case that is relevant: https://github.com/fabric8io/kubernetes-client/issues/4388

andreaTP avatar Sep 05 '22 12:09 andreaTP

see related issue: https://github.com/fabric8io/kubernetes-client/issues/4527

csviri avatar Oct 25 '22 12:10 csviri

🎉 great work!

andreaTP avatar Nov 23 '22 13:11 andreaTP