controller-runtime Dynamically cache objects to handle permission changes and lower cache size

Today, there are typically 2 types of operators, those that watch a bunch of namespaces and those that watch a single namespace. If an operator is watching the entire cluster, they are reacting to changes to their own CR in the entire cluster. Many operators also require secondary resources during their lifecycle. Most operators only need to watch these secondary resources in a specific namespace vs the entire cluster.

In the above scenario, the ideal set of cached resources are:

All primary objects
Any secondary objects whose events should trigger primary objects reconciliation
Any other object required by the controller to function

The ideal caching mechanisms would allow:

Watching a particular resource across the cluster (primary objects)
Watching a set of resources on behalf of a primary object only while that primary object exists (for secondary objects)

If the primary object is an instance of a namespace-scoped CRD, you will often want namespace scoped watches for the secondary objects.

If an operator is not given the proper permissions to access its secondary resources at startup, the caches will not be populated. If the cluster admin grants the operator the proper permissions after startup, the caches will still not be populated causing secondary objects to not be found.

The MultiNamespacedCache seems like a useful solution to the above problem. Unfortunately, this cache requires all of the namespaces to be known and accessible upfront. Knowing the namespaces upfront is not necessarily a bad thing but it may not be able to fill the cache without the proper permissions.

Operators also need to limit their cache size for memory constrained environments. Today we offer filtering where only the filtered items end up in your cache, but if you look for something not in that list it is NOT FOUND. It would be preferable to cache only the things we need as we get them.

Sep 03 '21 19:09 jmrodri

I don't think permissions changing after startup is a use case we should consider or care about. That's part of the setup process.

Sep 03 '21 19:09 coderanger

Watching a set of resources on behalf of a primary object only while that primary object exists (for secondary objects)

This doesn't sound like something that is scoped to a namespace, but something that would require arbitrary selectors.

It also requires essentially a second controller that manages the caches based on the presence of primary objects (We might be able to have the cache implementation allocate additional caches but it won't be able to clear them up).

Overall this seems like something that is going to be pretty complicated to use, because it will need a lot of configuration (to infer from the primary object what secondary caches with what selector to create and to clear them up on delete (which likely requires looking at the entire primary object cache, as multiple of them could be wanting the same secondary object cache)).

For the RBAC use case, I would suggest to simply restart the controller, as @coderanger pointed out, that is a configuration change (If you disagree here, please describe the mechanism you are using to change rbac, how often it changes etc).

For the cache size issue it seems easier to simply label everything and use a label selector, that way the api server will do the filtering for you and you only have the objects in your cache that you actually care about.

Sep 03 '21 21:09 alvaroaleman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 02 '21 22:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 01 '22 22:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Jan 31 '22 22:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 31 '22 22:01 k8s-ci-robot

Re-opening to continue the discussion of the "RBAC might change" use case.

In the operator-framework project, we've heard from cluster administrators that they want more control over the RBAC that is associated with the service account used to run the operator. Essentially, they want to be able to have fine-grained control to limit the operator's ability to do certain things.

Consider an operator that reconciles secrets as part of an operand. To reconcile secrets across the cluster, the operator needs to be able to list/watch them to react to changes. However, that exposes every single cluster secret to that controller. Therefore a cluster administrator decides to change that operator's RBAC to scope it down to a specific set of namespaces, and maybe even specific resource names.

Essentially, the RBAC for an operator is the domain of the cluster admin, not the operator's self-defined RBAC. And RBAC requirements change over the course of the lifetime of a cluster.

We're investigating various approaches to RBAC templating where an operator could deliver a template (perhaps with a default rendering) that a cluster admin could then manipulate. They'd theoretically be able to cut across both scope lines as well as specific "capability" lines (e.g. permit an operator to create ingresses or not for its operands).

A well-behaved operator should be able to detect during a reconciliation that it has insufficient RBAC for operand-related operations and then update the primary object's status to report back that it is unable to reconcile due to insufficient permission.

Sure, the cluster admin could restart the controllers after changing the RBAC, but it's not super obvious which controllers need to be restarted when RBAC changes. A cluster admin would need to map any rule change through the binding to find the service account and then figure out which controllers are using that service account. Not impossible, but not exactly user-friendly.

This does seem like a general problem, but perhaps there are other solutions (e.g. a separate controller that watches RBAC changes and automatically bounces affected pods). IMO, it would be nice to avoid a full restart though, since that's pretty heavy handed.

Jun 15 '22 19:06 joelanford

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 12 '22 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 11 '22 16:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 11 '22 17:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 11 '22 17:12 k8s-ci-robot