make 'opm serve' initialization asynchronous
Signed-off-by: Joe Lanford [email protected]
Description of the change:
This PR modifies opm serve so that loading FBC is handled asynchronously in the background so that the grpc server can start immediately, thus enabling the health probes to return successfully faster.
All of the grpc endpoints that require the FBC to be loaded now block, waiting for the initialization to complete. Clients can now initiate a connection and the server will block the response until initialization completes. Clients that don't want to wait indefinitely can set timeouts on their requests.
Motivation for the change:
While we shipped a change in catalog-operator to make use of the startupProbe functionality, that fix will not solve problems caused by opm serve's long startup times on clusters without the startupProbe.
This fix solves that problem by ensuring the grpc server starts immediately (much like the sqlite-based grpc server).
Reviewer Checklist
- [x] Implementation matches the proposed design, or proposal is updated to match implementation
- [x] Sufficient unit test coverage
- [ ] Sufficient end-to-end test coverage
- [ ] Docs updated or added to
/docs - [x] Commit messages sensible and descriptive
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: joelanford
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [joelanford]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
Merging #977 (2c6e79b) into master (8250533) will increase coverage by
2.67%. The diff coverage is85.45%.
:exclamation: Current head 2c6e79b differs from pull request most recent head 304c402. Consider uploading reports for the commit 304c402 to get more accurate results
@@ Coverage Diff @@
## master #977 +/- ##
==========================================
+ Coverage 52.60% 55.27% +2.67%
==========================================
Files 104 130 +26
Lines 9422 10826 +1404
==========================================
+ Hits 4956 5984 +1028
- Misses 3536 3763 +227
- Partials 930 1079 +149
| Impacted Files | Coverage Δ | |
|---|---|---|
| pkg/registry/query.go | 67.50% <85.45%> (+6.56%) |
:arrow_up: |
| pkg/image/execregistry/registry.go | 54.54% <0.00%> (ø) |
|
| pkg/lib/unstructured/unstructured.go | 60.71% <0.00%> (ø) |
|
| pkg/api/grpc_health_v1/health_grpc.pb.go | 0.00% <0.00%> (ø) |
|
| ...inertools/containertoolsfakes/fake_label_reader.go | 0.00% <0.00%> (ø) |
|
| pkg/sqlite/sqlitefakes/fake_rowscanner.go | 47.78% <0.00%> (ø) |
|
| pkg/lib/log/writerhook.go | 0.00% <0.00%> (ø) |
|
| pkg/lib/config/validate.go | 0.00% <0.00%> (ø) |
|
| ...s/containertoolsfakes/fake_dockerfile_generator.go | 0.00% <0.00%> (ø) |
|
| pkg/image/containerdregistry/resolver.go | 57.53% <0.00%> (ø) |
|
| ... and 41 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 8250533...304c402. Read the comment docs.
https://github.com/operator-framework/operator-registry/runs/7029116066?check_suite_focus=true#step:4:80
I think this is saying that var _ GRPCQuery = Querier{} would fail to compile. And I think this PR "breaks" that by making all of those Querier methods use pointer receivers. I don't think we ever intentionally tried to ensure Querier implemented GRPCQuery
However, we did intend for *Querier to implement GRPCQuery and that is still the case:
https://github.com/joelanford/operator-registry/blob/0a337c5535ae605d4b3fd4c86f9592d0180ce0ec/pkg/registry/query.go#L45
/hold
There are some implications on OLM that we need to consider:
- With the current request-blocking behavior, that might cause the OLM cache controller to block anytime it makes a request against a catalog source that is initializing. Blocking could manifest as higher latency reconciliations.
- If we add a server-enforced timeout, the OLM cache controller could get deadline errors during catalog startup and cause unexpected behavior (e.g. failed resolutions, incorrect retry logic)
@joelanford: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@joelanford is this issue on permanent hold or do you see this getting resolved as is?
@jdockter I don't think it's permanently blocked. We just need to prioritize checking the implications of this change on clients of the GRPC server (catalog-operator and packageserver).
Closing in favor of #1005