operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
Make CatalogSource the source of truth for available catalogs.
Internally, the catalog operator has always maintained a set of registry clients for each CatalogSource. Although this set is reconciled toward containing a client per CatalogSource object, there is some latency before changes made to CatalogSources are reflected in the client set, and differences between CatalogSources and client set membership are a potential source of error.
Instead, the catalog operator should list CatalogSources from its informer cache to determine which catalogs are reachable from a given namespace.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: benluddy
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [benluddy]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/hold
Here's the progression of the ResolutionFailed condition with this patch. The second line is new:
v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"Unknown", Reason:"", Message:"", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}
v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"True", Reason:"ErrorPreventedResolution", Message:"error using catalog without-registry-server-q8l65 (in namespace subscription-e2e-q8q7t): registry server not reachable for catalogsource subscription-e2e-q8q7t/without-registry-server-q8l65", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}
v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"True", Reason:"ErrorPreventedResolution", Message:"error using catalog without-registry-server-q8l65 (in namespace subscription-e2e-q8q7t): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.96.106.125:50051: i/o timeout\"", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}
Without it, the unreachable catalog is treated as completely empty and resolution proceeds as normal. This can produce ResolutionFailed: true with reason ConstraintsNotSatsfiable, or it can actually choose to install something that is technically acceptable but lower priority than the candidate in the unreachable catalog. In other words, transient skew between the SourceStore and the informer cache can cause the catalog operator to behave nondeterministically.
@benluddy: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thanks for the PR @benluddy 🎉
Do you think you'll have some cycles to get this PR to a reviewable stage? Or would you prefer someone from the operator-framework team to take over this PR for you?
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I still think this is worthwhile, but I am several years out of the loop when it comes to operator framework issues.