kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

Reimplement watch cache storage with btree

Open serathius opened this issue 1 year ago • 10 comments
trafficstars

Continuing work of https://github.com/kubernetes/kubernetes/pull/108392

Started from taking the exact implementation by @MadhavJivrajani, however updated the btree factor to 32.

Using btree without any changes already is faster than existing ThreadedSafeStore. There are still many improvements that could be implemented.

Btree is awesome for watch cache due to:

  • Support key ranges
  • Returns ordered results

Also etcd uses btree to serve ranges too.

Performance improvement:

  • 25% improvement in operations per second
  • 15% less bytes allocated
goos: linux
goarch: amd64
pkg: k8s.io/apiserver/pkg/storage/cacher
cpu: AMD Ryzen 5 7600X 6-Core Processor
                                                  │  base.txt   │            btree.txt             │
                                                  │   sec/op    │   sec/op     vs base                │
StoreListCreate/RV=NotOlderThan-12                  698.8µ ± 4%   227.9µ ± 2%  -67.39% (p=0.000 n=10)
StoreListCreate/RV=ExactMatch-12                    356.4µ ± 2%   358.7µ ± 2%        ~ (p=0.631 n=10)
StoreList/RV=Empty-12                               982.5µ ± 6%   722.9µ ± 4%  -26.43% (p=0.000 n=10)
StoreList/RV=NotOlderThan-12                        972.3µ ± 2%   700.7µ ± 2%  -27.93% (p=0.000 n=10)
StoreList/RV=MatchExact-12                          16.55m ± 3%   16.38m ± 2%        ~ (p=0.218 n=10)
StoreList/Paginate/RV=Empty-12                      18.55m ± 1%   18.49m ± 1%        ~ (p=0.353 n=10)
StoreList/Paginate/RV=NotOlderThan-12               18.39m ± 3%   18.45m ± 1%        ~ (p=0.971 n=10)
StoreList/Paginate/RV=MatchExact-12                 18.28m ± 1%   18.47m ± 1%   +1.03% (p=0.003 n=10)
StoreList/NodeIndexed/RV=Empty-12                   1.791m ± 3%   1.796m ± 2%        ~ (p=0.796 n=10)
StoreList/NodeIndexed/RV=NotOlderThan-12            1.234m ± 4%   1.250m ± 3%        ~ (p=0.796 n=10)
StoreList/NodeIndexed/RV=MatchExact-12               7.460 ± 1%    7.455 ± 1%        ~ (p=0.796 n=10)
StoreList/NodeIndexed/Paginate/RV=Empty-12          70.41m ± 1%   64.66m ± 4%   -8.17% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=NotOlderThan-12   69.42m ± 1%   63.59m ± 1%   -8.39% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=MatchExact-12      8.077 ± 1%    8.088 ± 1%        ~ (p=0.165 n=10)
StoreList/Namespace/RV=Empty-12                     6.421m ± 1%   1.675m ± 1%  -73.91% (p=0.000 n=10)
StoreList/Namespace/RV=NotOlderThan-12              5.670m ± 2%   1.069m ± 1%  -81.14% (p=0.000 n=10)
StoreList/Namespace/RV=MatchExact-12                15.84m ± 2%   16.33m ± 3%   +3.09% (p=0.000 n=10)
geomean                                             13.05m        9.789m       -24.98%

                                                  │   base.txt    │             btree.txt             │
                                                  │     B/op      │     B/op      vs base                │
StoreListCreate/RV=NotOlderThan-12                   293.7Ki ± 4%   233.1Ki ± 4%  -20.65% (p=0.000 n=10)
StoreListCreate/RV=ExactMatch-12                     115.0Ki ± 0%   114.9Ki ± 0%   -0.09% (p=0.000 n=10)
StoreList/RV=Empty-12                                6.179Mi ± 0%   6.023Mi ± 0%   -2.53% (p=0.000 n=10)
StoreList/RV=NotOlderThan-12                         6.163Mi ± 0%   6.007Mi ± 0%   -2.53% (p=0.000 n=10)
StoreList/RV=MatchExact-12                           60.32Mi ± 0%   60.29Mi ± 0%        ~ (p=0.529 n=10)
StoreList/Paginate/RV=Empty-12                       64.29Mi ± 0%   64.05Mi ± 0%   -0.37% (p=0.000 n=10)
StoreList/Paginate/RV=NotOlderThan-12                64.27Mi ± 0%   64.04Mi ± 0%   -0.35% (p=0.000 n=10)
StoreList/Paginate/RV=MatchExact-12                  63.64Mi ± 0%   63.56Mi ± 0%   -0.12% (p=0.000 n=10)
StoreList/NodeIndexed/RV=Empty-12                    7.928Mi ± 1%   7.929Mi ± 1%   +0.01% (p=0.035 n=10)
StoreList/NodeIndexed/RV=NotOlderThan-12             6.328Mi ± 0%   6.328Mi ± 0%        ~ (p=0.739 n=10)
StoreList/NodeIndexed/RV=MatchExact-12               5.785Gi ± 0%   5.782Gi ± 0%        ~ (p=0.315 n=10)
StoreList/NodeIndexed/Paginate/RV=Empty-12           284.0Mi ± 0%   257.0Mi ± 0%   -9.51% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=NotOlderThan-12    282.2Mi ± 0%   255.3Mi ± 0%   -9.55% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=MatchExact-12      6.073Gi ± 0%   6.048Gi ± 0%   -0.42% (p=0.000 n=10)
StoreList/Namespace/RV=Empty-12                     23.745Mi ± 0%   8.123Mi ± 0%  -65.79% (p=0.000 n=10)
StoreList/Namespace/RV=NotOlderThan-12              22.143Mi ± 0%   6.521Mi ± 0%  -70.55% (p=0.000 n=10)
StoreList/Namespace/RV=MatchExact-12                 63.39Mi ± 0%   63.32Mi ± 0%   -0.11% (p=0.000 n=10)
geomean                                              33.76Mi        28.64Mi       -15.15%

                                                  │  base.txt   │            btree.txt            │
                                                  │  allocs/op  │  allocs/op   vs base               │
StoreListCreate/RV=NotOlderThan-12                   561.0 ± 0%    559.0 ± 0%  -0.36% (p=0.000 n=10)
StoreListCreate/RV=ExactMatch-12                     874.5 ± 0%    875.0 ± 0%       ~ (p=0.407 n=10)
StoreList/RV=Empty-12                                346.5 ± 1%    345.0 ± 2%       ~ (p=0.236 n=10)
StoreList/RV=NotOlderThan-12                         88.00 ± 1%    87.00 ± 1%  -1.14% (p=0.002 n=10)
StoreList/RV=MatchExact-12                          460.6k ± 0%   456.8k ± 0%  -0.83% (p=0.000 n=10)
StoreList/Paginate/RV=Empty-12                      490.6k ± 0%   486.8k ± 0%  -0.77% (p=0.000 n=10)
StoreList/Paginate/RV=NotOlderThan-12               490.4k ± 0%   486.6k ± 0%  -0.78% (p=0.000 n=10)
StoreList/Paginate/RV=MatchExact-12                 492.9k ± 0%   489.1k ± 0%  -0.77% (p=0.000 n=10)
StoreList/NodeIndexed/RV=Empty-12                   32.08k ± 0%   32.10k ± 0%       ~ (p=0.061 n=10)
StoreList/NodeIndexed/RV=NotOlderThan-12            6.614k ± 1%   6.614k ± 1%       ~ (p=0.642 n=10)
StoreList/NodeIndexed/RV=MatchExact-12              46.96M ± 0%   46.58M ± 0%  -0.81% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=Empty-12          2.147M ± 0%   1.948M ± 0%  -9.28% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=NotOlderThan-12   2.119M ± 0%   1.920M ± 0%  -9.38% (p=0.000 n=10)
StoreList/NodeIndexed/Paginate/RV=MatchExact-12     47.93M ± 0%   47.46M ± 0%  -0.97% (p=0.000 n=10)
StoreList/Namespace/RV=Empty-12                     31.84k ± 0%   31.72k ± 0%  -0.39% (p=0.000 n=10)
StoreList/Namespace/RV=NotOlderThan-12              6.314k ± 0%   6.218k ± 0%  -1.53% (p=0.000 n=10)
StoreList/Namespace/RV=MatchExact-12                490.1k ± 0%   486.4k ± 0%  -0.76% (p=0.000 n=10)
geomean                                             78.82k        77.48k       -1.69%

/kind feature

Use btree data structure for watch cache

serathius avatar Aug 17 '24 09:08 serathius

👍 really great work!

there is another https://github.com/kubernetes/kubernetes/pull/126753 seems duplicated? :)

xuzhenglun avatar Aug 17 '24 10:08 xuzhenglun

/retest

serathius avatar Aug 17 '24 15:08 serathius

Oh, found that original PoC had very inefficient implementation of Get (https://github.com/kubernetes/kubernetes/pull/108392/files#diff-7dbc4a86989f847d364c09b2370266542ccbb72edc66d0f346559771b1445e56R200-R211) that scanned whole keyspace. Fixed

serathius avatar Aug 18 '24 07:08 serathius

Tests fails, in last change I must have introduced a bug.

serathius avatar Aug 19 '24 09:08 serathius

/triage accepted

fedebongio avatar Aug 22 '24 20:08 fedebongio

ready to review, ping @deads2k @jpbetz

serathius avatar Aug 25 '24 19:08 serathius

ping @deads2k

serathius avatar Aug 26 '24 10:08 serathius

I'd really like to see if @wojtek-t can review changes to the watch cache.

It's on my list and I definitely would like to review this PR - just didn't yet get to it.

wojtek-t avatar Aug 27 '24 10:08 wojtek-t

Just storage implementation using btree and tests. Removed integration to watch cache to avoid too many approvals on feature flags.

serathius avatar Oct 24 '24 17:10 serathius

Got to agreement about the indexer code. PTAL @jpbetz @deads2k @wojtek-t

serathius avatar Oct 25 '24 07:10 serathius

/lgtm /approve

wojtek-t avatar Oct 29 '24 11:10 wojtek-t

LGTM label has been added.

Git tree hash: 06b7f5e845b72661a9035b289a0ab22bc2425868

k8s-ci-robot avatar Oct 29 '24 11:10 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: serathius, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 29 '24 11:10 k8s-ci-robot

Fixed typo caught by verify.

serathius avatar Oct 29 '24 12:10 serathius

/lgtm

wojtek-t avatar Oct 29 '24 12:10 wojtek-t

LGTM label has been added.

Git tree hash: 2b01e9e4c7f1726b96bee7c10dbc0ba07066a988

k8s-ci-robot avatar Oct 29 '24 12:10 k8s-ci-robot

@serathius can we reflect what you mentioned in an earlier comment in the release notes? It currently says Use btree data structure for watch cache but should it be New implementation of watch cache using btree data structure. Implementation is not enabled yet. (or something equivalent?).

PS: probably need to do the same for the PR title too.

dims avatar Oct 29 '24 13:10 dims

@dims Done

serathius avatar Oct 29 '24 15:10 serathius

thanks @serathius ! will avoid unnecessary questions :)

dims avatar Oct 29 '24 16:10 dims

Changelog suggestion

-New implementation of watch cache using btree data structure. Implementation is not enabled yet.

If it's not enabled, I wouldn't changelog it.

sftim avatar Oct 31 '24 19:10 sftim