operator-registry icon indicating copy to clipboard operation
operator-registry copied to clipboard

make 'opm serve' initialization asynchronous

Open joelanford opened this issue 3 years ago • 7 comments

Signed-off-by: Joe Lanford [email protected]

Description of the change: This PR modifies opm serve so that loading FBC is handled asynchronously in the background so that the grpc server can start immediately, thus enabling the health probes to return successfully faster.

All of the grpc endpoints that require the FBC to be loaded now block, waiting for the initialization to complete. Clients can now initiate a connection and the server will block the response until initialization completes. Clients that don't want to wait indefinitely can set timeouts on their requests.

Motivation for the change: While we shipped a change in catalog-operator to make use of the startupProbe functionality, that fix will not solve problems caused by opm serve's long startup times on clusters without the startupProbe.

This fix solves that problem by ensuring the grpc server starts immediately (much like the sqlite-based grpc server).

Reviewer Checklist

  • [x] Implementation matches the proposed design, or proposal is updated to match implementation
  • [x] Sufficient unit test coverage
  • [ ] Sufficient end-to-end test coverage
  • [ ] Docs updated or added to /docs
  • [x] Commit messages sensible and descriptive

joelanford avatar Jun 23 '22 15:06 joelanford

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jun 23 '22 15:06 openshift-ci[bot]

Codecov Report

Merging #977 (2c6e79b) into master (8250533) will increase coverage by 2.67%. The diff coverage is 85.45%.

:exclamation: Current head 2c6e79b differs from pull request most recent head 304c402. Consider uploading reports for the commit 304c402 to get more accurate results

@@            Coverage Diff             @@
##           master     #977      +/-   ##
==========================================
+ Coverage   52.60%   55.27%   +2.67%     
==========================================
  Files         104      130      +26     
  Lines        9422    10826    +1404     
==========================================
+ Hits         4956     5984    +1028     
- Misses       3536     3763     +227     
- Partials      930     1079     +149     
Impacted Files Coverage Δ
pkg/registry/query.go 67.50% <85.45%> (+6.56%) :arrow_up:
pkg/image/execregistry/registry.go 54.54% <0.00%> (ø)
pkg/lib/unstructured/unstructured.go 60.71% <0.00%> (ø)
pkg/api/grpc_health_v1/health_grpc.pb.go 0.00% <0.00%> (ø)
...inertools/containertoolsfakes/fake_label_reader.go 0.00% <0.00%> (ø)
pkg/sqlite/sqlitefakes/fake_rowscanner.go 47.78% <0.00%> (ø)
pkg/lib/log/writerhook.go 0.00% <0.00%> (ø)
pkg/lib/config/validate.go 0.00% <0.00%> (ø)
...s/containertoolsfakes/fake_dockerfile_generator.go 0.00% <0.00%> (ø)
pkg/image/containerdregistry/resolver.go 57.53% <0.00%> (ø)
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8250533...304c402. Read the comment docs.

codecov[bot] avatar Jun 23 '22 17:06 codecov[bot]

https://github.com/operator-framework/operator-registry/runs/7029116066?check_suite_focus=true#step:4:80

I think this is saying that var _ GRPCQuery = Querier{} would fail to compile. And I think this PR "breaks" that by making all of those Querier methods use pointer receivers. I don't think we ever intentionally tried to ensure Querier implemented GRPCQuery

However, we did intend for *Querier to implement GRPCQuery and that is still the case:

https://github.com/joelanford/operator-registry/blob/0a337c5535ae605d4b3fd4c86f9592d0180ce0ec/pkg/registry/query.go#L45

joelanford avatar Jun 23 '22 18:06 joelanford

/hold

There are some implications on OLM that we need to consider:

  1. With the current request-blocking behavior, that might cause the OLM cache controller to block anytime it makes a request against a catalog source that is initializing. Blocking could manifest as higher latency reconciliations.
  2. If we add a server-enforced timeout, the OLM cache controller could get deadline errors during catalog startup and cause unexpected behavior (e.g. failed resolutions, incorrect retry logic)

joelanford avatar Jul 07 '22 16:07 joelanford

@joelanford: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Jul 07 '22 22:07 openshift-ci[bot]

@joelanford is this issue on permanent hold or do you see this getting resolved as is?

jdockter avatar Jul 24 '22 21:07 jdockter

@jdockter I don't think it's permanently blocked. We just need to prioritize checking the implications of this change on clients of the GRPC server (catalog-operator and packageserver).

joelanford avatar Jul 25 '22 12:07 joelanford

Closing in favor of #1005

joelanford avatar Aug 24 '22 13:08 joelanford