vitess
vitess copied to clipboard
dynamic pool implementation
Description
Related Issue(s)
- https://github.com/vitessio/vitess/issues/9706
Checklist
- [ ] "Backport me!" label has been added if this change should be backported
- [ ] Tests were added or are not required
- [ ] Documentation was added or is not required
Deployment Notes
Review Checklist
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
General
- [ ] Ensure that the Pull Request has a descriptive title.
- [ ] If this is a change that users need to know about, please apply the
release notes (needs details)label so that merging is blocked unless the summary release notes document is included. - [ ] If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
- [ ] If a workflow is added or modified, each items in
Jobsshould be named in order to mark it asrequired. If the workflow should be required, the GitHub Admin should be notified.
Bug fixes
- [ ] There should be at least one unit or end-to-end test.
- [ ] The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.
Non-trivial changes
- [ ] There should be some code comments as to why things are implemented the way they are.
New/Existing features
- [ ] Should be documented, either by modifying the existing documentation or creating new documentation.
- [ ] New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.
Backward compatibility
- [ ] Protobuf changes should be wire-compatible.
- [ ] Changes to
_vttables and RPCs need to be backward compatible. - [ ]
vtctlcommand output order should be stable andawk-able.
I haven't looked in depth at the actual new implementation. I thought it would be much more interesting to look at the benchmarks before reviewing it... But the benchmarks were giving really flat results! This is because the old benchmarks were not properly designed, so they didn't show any actionable information. I've pushed a commit that improves the benchmarks.
The old benchmarks were testing lots of different pool sizes (i.e. "capacities"), but under a very weird fixed-concurrency pattern. I've updated them to test several permutations of concurrency vs capacity, and to use Go's built-in parallel benchmark facility, which handles this kind of parallel tests much more effectively.
We're now testing, for the two pool implementations, capacities of 8, 64 and 512. We could test more capacities, but the results are actually redundant. What we care about when benchmarking the pool's capacity is: how the pool performs when there's rarely enough resources to fetch (8), how the pool performs when the resources to get are similar to the amount of concurrency we're experiencing (64) and how the pool performs when there's always enough resources (512).
Then, for these 3 base cases, we test with several parallelism settings (i.e. amount of concurrent goroutines fetching and returning objects from the pool).
While I was at it, I also fixed the memory reporting, and I changed the old implementation to not pre-fill the pool in benchmarks, so its behavior matches the new's.
Here are the results when ran in my machine:
goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: AMD Ryzen 7 2700X Eight-Core Processor
BenchmarkGetPut
BenchmarkGetPut/static/x1-cap8
BenchmarkGetPut/static/x1-cap8-16 1000000 1159 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8
BenchmarkGetPut/dynamic/x1-cap8-16 1563823 746.5 ns/op 89 B/op 1 allocs/op
BenchmarkGetPut/static/x4-cap8
BenchmarkGetPut/static/x4-cap8-16 1027155 1113 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8
BenchmarkGetPut/dynamic/x4-cap8-16 940041 2010 ns/op 161 B/op 2 allocs/op
BenchmarkGetPut/static/x8-cap8
BenchmarkGetPut/static/x8-cap8-16 1098236 1093 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8
BenchmarkGetPut/dynamic/x8-cap8-16 953550 2142 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x32-cap8
BenchmarkGetPut/static/x32-cap8-16 1061026 1001 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8
BenchmarkGetPut/dynamic/x32-cap8-16 773541 2252 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x128-cap8
BenchmarkGetPut/static/x128-cap8-16 981444 1097 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8
BenchmarkGetPut/dynamic/x128-cap8-16 552878 2287 ns/op 163 B/op 2 allocs/op
BenchmarkGetPut/static/x1-cap64
BenchmarkGetPut/static/x1-cap64-16 1690519 700.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64
BenchmarkGetPut/dynamic/x1-cap64-16 2596731 489.8 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x4-cap64
BenchmarkGetPut/static/x4-cap64-16 1689446 709.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64
BenchmarkGetPut/dynamic/x4-cap64-16 2371755 558.9 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x8-cap64
BenchmarkGetPut/static/x8-cap64-16 1689732 852.9 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64
BenchmarkGetPut/dynamic/x8-cap64-16 1400059 865.9 ns/op 81 B/op 1 allocs/op
BenchmarkGetPut/static/x32-cap64
BenchmarkGetPut/static/x32-cap64-16 1000000 1083 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64
BenchmarkGetPut/dynamic/x32-cap64-16 858646 2441 ns/op 161 B/op 2 allocs/op
BenchmarkGetPut/static/x128-cap64
BenchmarkGetPut/static/x128-cap64-16 1000000 1151 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64
BenchmarkGetPut/dynamic/x128-cap64-16 695175 2446 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x1-cap512
BenchmarkGetPut/static/x1-cap512-16 1488093 797.0 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512
BenchmarkGetPut/dynamic/x1-cap512-16 2676828 479.0 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x4-cap512
BenchmarkGetPut/static/x4-cap512-16 1657023 715.9 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512
BenchmarkGetPut/dynamic/x4-cap512-16 2294755 561.7 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x8-cap512
BenchmarkGetPut/static/x8-cap512-16 1489998 800.5 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512
BenchmarkGetPut/dynamic/x8-cap512-16 2527941 565.4 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x32-cap512
BenchmarkGetPut/static/x32-cap512-16 1668606 710.1 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512
BenchmarkGetPut/dynamic/x32-cap512-16 2302455 583.3 ns/op 64 B/op 1 allocs/op
BenchmarkGetPut/static/x128-cap512
BenchmarkGetPut/static/x128-cap512-16 1631232 813.2 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512
BenchmarkGetPut/dynamic/x128-cap512-16 1000000 2634 ns/op 159 B/op 1 allocs/op
PASS
Some insights you can now get from these:
- The new pool implementation always allocates memory, even in case where resources are uncontended! That's not good. The fast path should never allocate memory. We should start by looking into that.
- The "serial fast path" is always faster in the new implementation (i.e. adding and returning resources from a single goroutine), which is... not very useful, because the pool won't behave like that in production.
- As soon as we start introducing resource contention (i.e. more goroutines than resources are available in the pool), the new implementation behaves worse than the old one regardless of concurrency.
- However, when the resources are uncontended (i.e. there are always more resources in the pool than goroutines fighting for them), the new pool beats the old one consistently, regardless of concurrency.
In summary: the new pool always allocates (which is important to fix), and behaves poorly when starved for resources, which is hard to guesstimate how often it happens in real-life VTTablet instances. My assumption is that it's not rare to have more incoming requests than available MySQL connections, so I think the new implementation needs to work on this particular case. Ideally, we'd see the pool being faster both when starved and non-starved!
Thank you @vmg for fixing the Benchmarks and reporting them here.
The allocation that is getting shown is from the Put calls to the pool.
Currently the resource object that is returned from the pool i.e. Resource (interface) is different from the struct that is stored in the pool ResourceWrapper. This causes the allocation to happen on every put.
To avoid this overhead we would need to change the pool method to return the struct. I did not do this as first step in implementation but if that shows up in benchmarks I believe this should be done next.
@harshit-gangal: the original pool also uses resourceWrapper for all connections, but it never allocates it on the heap because it's very careful to pass it always by value. You don't need to return the resourceWrapper to prevent a heap allocation, just think through where it's being used and how it can be kept off the heap.
Other than that, the put appends to the slice rawConnection []*resourceWrapper which can be the cause of heap allocation.
It seems like that should be a slice of []resourceWrapper -- without a pointer. That's where the allocation is coming from.
The surely improved some of the benchmarks
goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8 4412192 248.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8 8340823 141.8 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap8-8 2049496 587.8 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8 1304605 919.9 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x8-cap8-8 2160702 574.7 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8 1301724 927.2 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x32-cap8-8 2136961 589.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8 1271604 946.2 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x128-cap8-8 2077538 573.9 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8 1239555 960.5 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x1-cap64-8 4781880 249.1 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8 8263513 139.0 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap64-8 4787172 250.6 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8 6452523 184.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap64-8 4498858 248.8 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8 4751376 258.4 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap64-8 4800966 248.1 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8 1000000 1000 ns/op 161 B/op 1 allocs/op
BenchmarkGetPut/static/x128-cap64-8 4821624 518.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8 1000000 1012 ns/op 162 B/op 2 allocs/op
BenchmarkGetPut/static/x1-cap512-8 3742974 321.8 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8 8500825 146.9 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap512-8 3428592 341.5 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8 5363811 210.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap512-8 3660873 341.5 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8 3887805 298.7 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap512-8 3515707 342.6 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8 2413462 495.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap512-8 3480156 351.2 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8 1000000 1120 ns/op 135 B/op 1 allocs/op
PASS
ok vitess.io/vitess/go/pools 49.667s
Note that right now all the remaining benchmarks that allocate are significantly slower than their old static counterparts. You need to fix those too! Once they don't allocate they will have very similar performance. :+1:
Benchmark after last commit. That changes waiting connection request logic to have common connection receiver channel.
goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8 4396682 249.1 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8 7383310 148.1 ns/op 7 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap8-8 1976714 603.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8 1382004 851.3 ns/op 46 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap8-8 2023770 589.6 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8 1403683 843.2 ns/op 43 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap8-8 2053064 571.8 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8 1402099 845.6 ns/op 43 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap8-8 2159823 589.0 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8 1371255 878.6 ns/op 44 B/op 0 allocs/op
BenchmarkGetPut/static/x1-cap64-8 4461562 249.0 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8 7801064 146.4 ns/op 7 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap64-8 4456218 248.1 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8 6059944 190.2 ns/op 9 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap64-8 4450134 249.9 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8 4528390 260.6 ns/op 12 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap64-8 4471232 250.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8 1338823 889.8 ns/op 47 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap64-8 4815217 501.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8 1290224 914.8 ns/op 46 B/op 0 allocs/op
BenchmarkGetPut/static/x1-cap512-8 4137897 289.8 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8 7847911 145.0 ns/op 7 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap512-8 4126407 327.5 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8 5106384 213.4 ns/op 10 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap512-8 3619280 371.7 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8 3771500 298.5 ns/op 14 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap512-8 3516030 345.1 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8 2419420 486.9 ns/op 23 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap512-8 3407198 386.8 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8 1127898 1049 ns/op 51 B/op 0 allocs/op
PASS
After reducing the common channel size to 4 times the maximum capacity
goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8 4172858 251.9 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8 8175582 146.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap8-8 2073106 575.8 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8 1421370 839.3 ns/op 6 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap8-8 2131052 570.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8 1429699 841.8 ns/op 3 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap8-8 2173950 568.1 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8 1423432 840.4 ns/op 3 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap8-8 2084209 585.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8 1382991 858.7 ns/op 3 B/op 0 allocs/op
BenchmarkGetPut/static/x1-cap64-8 4837154 248.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8 8223388 145.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap64-8 4842416 268.0 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8 6349221 189.7 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap64-8 4840467 267.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8 4507173 263.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap64-8 4520050 268.7 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8 1351778 893.3 ns/op 5 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap64-8 4799488 560.4 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8 1000000 1082 ns/op 3 B/op 0 allocs/op
BenchmarkGetPut/static/x1-cap512-8 3203734 352.8 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8 6901803 154.4 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x4-cap512-8 3792057 349.2 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8 5348278 210.2 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x8-cap512-8 3303650 379.9 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8 3957998 302.3 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x32-cap512-8 3486044 380.2 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8 2480737 491.5 ns/op 0 B/op 0 allocs/op
BenchmarkGetPut/static/x128-cap512-8 3116094 376.9 ns/op 24 B/op 3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8 1000000 1268 ns/op 2 B/op 0 allocs/op
PASS
- The new pool implementation based on the database/sql design from golang core is efficient when the number of threads pounding the pool is less < 32
- The old pool implementation being lock free does not block the threads on mutex hence perform better for higher threads.
- The modified old pool implementation (learnings from new pool applied for old pool) does not give significant improvement on any of the front
Here are two things:
- come up with a lock free slice implementation which performs better for high number of parallelism a. I tried the library from here https://github.com/golang-design/lockfree - this improved performance for higher threads but still was not close enough with old pool.
- Meantime, I plan to modify the old pool implementation and add connection holder for connections which have different settings applied. The reason is that old pool is tested in production over the new pool.
This is not actively developed. Closing for now