operator-registry icon indicating copy to clipboard operation
operator-registry copied to clipboard

pkg/cache: add preferred pogreb database cache impl

Open joelanford opened this issue 1 year ago • 4 comments

Description of the change: This PR adds a new caching implementation that has some improvements over our existing JSON cache implementation. It also maintains backward compatibility to gracefully handle serving existing JSON-based caches.

I ran a few tests for compatibility:

  1. With an opm from this PR, the following happens:

    • Success when the cache contains JSON files
    • Success when the cache contains a Pogreb DB
    • Success when the cache contains both the JSON and Pogreb files side-by-side (and it uses Pogreb)
  2. With an opm from the current master branch, the following happens:

    • Success when the cache contains JSON files
    • Failure when the cache contains a Pogreb DB
    • Success when the cache contains both the JSON and Pogreb files side-by-side (and it uses JSON). Therefore adding a Pogreb DB to the cache is forward compatible.

Motivation for the change:

  • It stores items in an embedded database (see https://github.com/akrylysov/pogreb), which seems to fit the mold of our use case (infrequent bulk load, read heavy clients)
  • It serializes/deserializes data as protobuf rather than JSON (so smaller and faster)
  • It uses ~50% less CPU to serve similar request loads
  • It uses ~66% less peak memory to serve similar request loads
  • Its steady state memory is about the same.
  • It is much faster to serve (ran some local benchmarks with ghz with 20 concurrent connections)
    • JSON
      • 10% latency 1.63s
      • 99% latency 2.53s
    • Pogreb
      • 10% latency 681ms
      • 99% latency 841ms
  • The digest calculation is based purely on the FBC and cache data (the current JSON digest uses the tar format and is pretty finicky)
    • For FBC, walk meta blobs, writing them each to the digester
    • For the cache, iterate all the items in the k/v store, writing each k+v to the digester

Cache size improvements

  • I've been testing with a migrated operatorhub catalog (that is one that has olm.csv.metadata)
  • Pogreb: 70M
  • Json: 89M

Reviewer Checklist

  • [ ] Implementation matches the proposed design, or proposal is updated to match implementation
  • [ ] Sufficient unit test coverage
  • [ ] Sufficient end-to-end test coverage
  • [ ] Docs updated or added to /docs
  • [ ] Commit messages sensible and descriptive

joelanford avatar Apr 18 '24 21:04 joelanford

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] avatar Apr 18 '24 21:04 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Apr 18 '24 21:04 openshift-ci[bot]

Codecov Report

Attention: Patch coverage is 60.86142% with 209 lines in your changes are missing coverage. Please review.

:exclamation: No coverage uploaded for pull request base (master@9fdedc2). Click here to learn what that means. Report is 4 commits behind head on master.

:exclamation: Current head 0082947 differs from pull request most recent head f9792af. Consider uploading reports for the commit f9792af to get more accurate results

Files Patch % Lines
pkg/cache/cache.go 56.58% 67 Missing and 22 partials :warning:
pkg/cache/pogrebv1.go 55.08% 33 Missing and 20 partials :warning:
pkg/cache/json.go 54.16% 32 Missing and 12 partials :warning:
alpha/declcfg/load.go 77.38% 13 Missing and 6 partials :warning:
pkg/cache/bundle_key.go 77.77% 2 Missing and 2 partials :warning:
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #1278   +/-   ##
=========================================
  Coverage          ?   50.05%           
=========================================
  Files             ?      135           
  Lines             ?    12993           
  Branches          ?        0           
=========================================
  Hits              ?     6504           
  Misses            ?     5405           
  Partials          ?     1084           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 19 '24 20:04 codecov[bot]

I had an epiphany overnight!

The third commit in the PR changes the way the cache is built. It writes meta objects to a temporary file and records the locations of each meta in the file, grouped by package. That way we can later read just the metas for a particular package into memory.

Then we go package by package building a model, converting to the package index, and writing api bundles to the cache. The beauty is that only a single package's model is loaded in memory at any given time.

This may mean that we can stop storing caches in the catalog image and can go back to building them on-the-fly when the container starts!

I noticed that when using an FBC with olm.csv.metadata, startup peak memory and time was basically inconsequential when building a cache on the fly.

joelanford avatar Apr 20 '24 13:04 joelanford

/lgtm

grokspawn avatar May 14 '24 20:05 grokspawn