Watch only for changes on resources participating in a service binding
As the investigation in #605 shown, the operator monitors a huge number of resources in all namespaces:
- all config maps and secrets
- all CSVs
- all CRDs owned by CSV
Such strategy puts a very high pressure on the operator's event queue on clusters with non-trivial workload, increasing the operator memory consumption, and eventually crashing it.
In order to reduce resource usage and behave reliably, we should start monitoring resources only if they are referred through a service binding:
- at the operator startup, start monitoring service bindings in all or given namespaces
- at service binding reconcile, start monitor referred service and application kinds, but only in the namespace of reconciled service binding. On top of that start monitoring service CRD and its CSV if any
This strategy is going to reduce significantly the number of events landing in operator's queue and open up possibility to scale better in large clusters containing a lot of namespaces.
I think we should also check the memory leak metrics on a namespace based installation and a cluster wide operator installation. To assess how broad/large the impact is.
So feature wise SBOperator should:
- Watch only for SBO objects (operate on existing state)
- when it is created validate if all provided info is right and resources exist. Provide error if problems as part of the status
- Mount resources to the app container according to the spec
- new flag should be introduced for SBO to watch secrets and underlying CRs for changes so config can be adjusted for those.. but only if user wants that. I see SBO as tool that does the job when asked (SB object created) and optionally can also watch and mount again changes in service CR or secret - but that is optional - I would not want some operator to randomly redeploy my app every time my CR change
CSV watching should be removed IMHO - it is core of the problem.
Also this approach will allow sbo to exist on clusters without putting any load on them. Openshift dedicated on start has 962 objects what operator will watch. When this is applied operator will only watch for SB objects and become invisible to admins when not used
With little bit of research I think that moving to that model will not cause any API changes - only behaviour under some corner cases. It is good to adapt this model before 1.0 as later it would be hard to do so without braking people's expectations.
This will:
- reduce codebase by half
- reduce memory usage and overal footprint of operator when not used
- simplify binding processing and it's documentation
- would allow to adapt proper error handling using conditions applied in reconcile loop.
- would be still relatively easy as it will be based on reduction of the features that are not functioning/little value
example simplifications
Currently operator looks for all CSV on cluster to get spec (even when user do not want to use it). We can make change to fetch spec (metadata) on reconcile loop only when binding was created.
We can stop reconciling binding that has errors continuously by flagging them
We could listen to kubernetes events on the secrets rather than continuous reconciliations
CC @gorkem
To improve the performance:
- The controller should stop watching changes to CSVs
- The controller should watch only specific workload types we've tested with ( may or may not be from OLM operators )
- The controller should watch only specific backing service types we've tested with ( may or may not be from OLM operators )
What can be read: The above list doesn't necessarily impact what can be 'read'. The controller may still be allowed to read a wide variety of types - that's OK!
In addition, we must introduce a way for an admin to configure what can be watched.
What is the usecase we would like to cover by watching anything besides Service Bindings? As @wtrocki stated above, it is very surprising for app developer that binding between application and service changes on its own. Such systems are hard to reason about.
So, you are going to stop watching CSVs, right? Our Sandbox clusters have thousands of them.
We are going to implement watches for resources bound via label selectors. The work is tracked here https://issues.redhat.com/browse/APPSVC-1112