operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Make CatalogSource the source of truth for available catalogs.

Open benluddy opened this issue 2 years ago • 4 comments

Internally, the catalog operator has always maintained a set of registry clients for each CatalogSource. Although this set is reconciled toward containing a client per CatalogSource object, there is some latency before changes made to CatalogSources are reflected in the client set, and differences between CatalogSources and client set membership are a potential source of error.

Instead, the catalog operator should list CatalogSources from its informer cache to determine which catalogs are reachable from a given namespace.

benluddy avatar May 19 '22 18:05 benluddy

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar May 19 '22 18:05 openshift-ci[bot]

/hold

benluddy avatar May 19 '22 18:05 benluddy

Here's the progression of the ResolutionFailed condition with this patch. The second line is new:

v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"Unknown", Reason:"", Message:"", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}
v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"True", Reason:"ErrorPreventedResolution", Message:"error using catalog without-registry-server-q8l65 (in namespace subscription-e2e-q8q7t): registry server not reachable for catalogsource subscription-e2e-q8q7t/without-registry-server-q8l65", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}
v1alpha1.SubscriptionCondition{Type:"ResolutionFailed", Status:"True", Reason:"ErrorPreventedResolution", Message:"error using catalog without-registry-server-q8l65 (in namespace subscription-e2e-q8q7t): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.96.106.125:50051: i/o timeout\"", LastHeartbeatTime:<nil>, LastTransitionTime:<nil>}

Without it, the unreachable catalog is treated as completely empty and resolution proceeds as normal. This can produce ResolutionFailed: true with reason ConstraintsNotSatsfiable, or it can actually choose to install something that is technically acceptable but lower priority than the candidate in the unreachable catalog. In other words, transient skew between the SourceStore and the informer cache can cause the catalog operator to behave nondeterministically.

benluddy avatar May 19 '22 18:05 benluddy

@benluddy: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Jun 20 '22 10:06 openshift-ci[bot]

Thanks for the PR @benluddy 🎉

Do you think you'll have some cycles to get this PR to a reviewable stage? Or would you prefer someone from the operator-framework team to take over this PR for you?

anik120 avatar Dec 15 '22 05:12 anik120

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot avatar Sep 23 '23 12:09 openshift-merge-robot

I still think this is worthwhile, but I am several years out of the loop when it comes to operator framework issues.

benluddy avatar Jan 26 '24 13:01 benluddy