vitess dynamic pool implementation

trafficstars

Description

Related Issue(s)

https://github.com/vitessio/vitess/issues/9706

Checklist

[ ] "Backport me!" label has been added if this change should be backported
[ ] Tests were added or are not required
[ ] Documentation was added or is not required

Deployment Notes

Jul 31 '22 17:07 harshit-gangal

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

[ ] Ensure that the Pull Request has a descriptive title.
[ ] If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
[ ] If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
[ ] If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

[ ] There should be at least one unit or end-to-end test.
[ ] The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

[ ] There should be some code comments as to why things are implemented the way they are.

New/Existing features

[ ] Should be documented, either by modifying the existing documentation or creating new documentation.
[ ] New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

[ ] Protobuf changes should be wire-compatible.
[ ] Changes to _vt tables and RPCs need to be backward compatible.
[ ] vtctl command output order should be stable and awk-able.

Jul 31 '22 17:07 vitess-bot[bot]

I haven't looked in depth at the actual new implementation. I thought it would be much more interesting to look at the benchmarks before reviewing it... But the benchmarks were giving really flat results! This is because the old benchmarks were not properly designed, so they didn't show any actionable information. I've pushed a commit that improves the benchmarks.

The old benchmarks were testing lots of different pool sizes (i.e. "capacities"), but under a very weird fixed-concurrency pattern. I've updated them to test several permutations of concurrency vs capacity, and to use Go's built-in parallel benchmark facility, which handles this kind of parallel tests much more effectively.

We're now testing, for the two pool implementations, capacities of 8, 64 and 512. We could test more capacities, but the results are actually redundant. What we care about when benchmarking the pool's capacity is: how the pool performs when there's rarely enough resources to fetch (8), how the pool performs when the resources to get are similar to the amount of concurrency we're experiencing (64) and how the pool performs when there's always enough resources (512).

Then, for these 3 base cases, we test with several parallelism settings (i.e. amount of concurrent goroutines fetching and returning objects from the pool).

While I was at it, I also fixed the memory reporting, and I changed the old implementation to not pre-fill the pool in benchmarks, so its behavior matches the new's.

Here are the results when ran in my machine:

goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: AMD Ryzen 7 2700X Eight-Core Processor         
BenchmarkGetPut
BenchmarkGetPut/static/x1-cap8
BenchmarkGetPut/static/x1-cap8-16         	 1000000	      1159 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8
BenchmarkGetPut/dynamic/x1-cap8-16        	 1563823	       746.5 ns/op	      89 B/op	       1 allocs/op
BenchmarkGetPut/static/x4-cap8
BenchmarkGetPut/static/x4-cap8-16         	 1027155	      1113 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8
BenchmarkGetPut/dynamic/x4-cap8-16        	  940041	      2010 ns/op	     161 B/op	       2 allocs/op
BenchmarkGetPut/static/x8-cap8
BenchmarkGetPut/static/x8-cap8-16         	 1098236	      1093 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8
BenchmarkGetPut/dynamic/x8-cap8-16        	  953550	      2142 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x32-cap8
BenchmarkGetPut/static/x32-cap8-16        	 1061026	      1001 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8
BenchmarkGetPut/dynamic/x32-cap8-16       	  773541	      2252 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x128-cap8
BenchmarkGetPut/static/x128-cap8-16       	  981444	      1097 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8
BenchmarkGetPut/dynamic/x128-cap8-16      	  552878	      2287 ns/op	     163 B/op	       2 allocs/op
BenchmarkGetPut/static/x1-cap64
BenchmarkGetPut/static/x1-cap64-16        	 1690519	       700.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64
BenchmarkGetPut/dynamic/x1-cap64-16       	 2596731	       489.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x4-cap64
BenchmarkGetPut/static/x4-cap64-16        	 1689446	       709.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64
BenchmarkGetPut/dynamic/x4-cap64-16       	 2371755	       558.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x8-cap64
BenchmarkGetPut/static/x8-cap64-16        	 1689732	       852.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64
BenchmarkGetPut/dynamic/x8-cap64-16       	 1400059	       865.9 ns/op	      81 B/op	       1 allocs/op
BenchmarkGetPut/static/x32-cap64
BenchmarkGetPut/static/x32-cap64-16       	 1000000	      1083 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64
BenchmarkGetPut/dynamic/x32-cap64-16      	  858646	      2441 ns/op	     161 B/op	       2 allocs/op
BenchmarkGetPut/static/x128-cap64
BenchmarkGetPut/static/x128-cap64-16      	 1000000	      1151 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64
BenchmarkGetPut/dynamic/x128-cap64-16     	  695175	      2446 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x1-cap512
BenchmarkGetPut/static/x1-cap512-16       	 1488093	       797.0 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512
BenchmarkGetPut/dynamic/x1-cap512-16      	 2676828	       479.0 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x4-cap512
BenchmarkGetPut/static/x4-cap512-16       	 1657023	       715.9 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512
BenchmarkGetPut/dynamic/x4-cap512-16      	 2294755	       561.7 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x8-cap512
BenchmarkGetPut/static/x8-cap512-16       	 1489998	       800.5 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512
BenchmarkGetPut/dynamic/x8-cap512-16      	 2527941	       565.4 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x32-cap512
BenchmarkGetPut/static/x32-cap512-16      	 1668606	       710.1 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512
BenchmarkGetPut/dynamic/x32-cap512-16     	 2302455	       583.3 ns/op	      64 B/op	       1 allocs/op
BenchmarkGetPut/static/x128-cap512
BenchmarkGetPut/static/x128-cap512-16     	 1631232	       813.2 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512
BenchmarkGetPut/dynamic/x128-cap512-16    	 1000000	      2634 ns/op	     159 B/op	       1 allocs/op
PASS

Some insights you can now get from these:

The new pool implementation always allocates memory, even in case where resources are uncontended! That's not good. The fast path should never allocate memory. We should start by looking into that.
The "serial fast path" is always faster in the new implementation (i.e. adding and returning resources from a single goroutine), which is... not very useful, because the pool won't behave like that in production.
As soon as we start introducing resource contention (i.e. more goroutines than resources are available in the pool), the new implementation behaves worse than the old one regardless of concurrency.
However, when the resources are uncontended (i.e. there are always more resources in the pool than goroutines fighting for them), the new pool beats the old one consistently, regardless of concurrency.

In summary: the new pool always allocates (which is important to fix), and behaves poorly when starved for resources, which is hard to guesstimate how often it happens in real-life VTTablet instances. My assumption is that it's not rare to have more incoming requests than available MySQL connections, so I think the new implementation needs to work on this particular case. Ideally, we'd see the pool being faster both when starved and non-starved!

Aug 09 '22 11:08 vmg

Thank you @vmg for fixing the Benchmarks and reporting them here.

The allocation that is getting shown is from the Put calls to the pool. Currently the resource object that is returned from the pool i.e. Resource (interface) is different from the struct that is stored in the pool ResourceWrapper. This causes the allocation to happen on every put. To avoid this overhead we would need to change the pool method to return the struct. I did not do this as first step in implementation but if that shows up in benchmarks I believe this should be done next.

Aug 09 '22 12:08 harshit-gangal

@harshit-gangal: the original pool also uses resourceWrapper for all connections, but it never allocates it on the heap because it's very careful to pass it always by value. You don't need to return the resourceWrapper to prevent a heap allocation, just think through where it's being used and how it can be kept off the heap.

Aug 09 '22 13:08 vmg

Other than that, the put appends to the slice rawConnection []*resourceWrapper which can be the cause of heap allocation.

Aug 09 '22 13:08 harshit-gangal

It seems like that should be a slice of []resourceWrapper -- without a pointer. That's where the allocation is coming from.

Aug 09 '22 14:08 vmg

The surely improved some of the benchmarks

goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8         	 4412192	       248.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8        	 8340823	       141.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap8-8         	 2049496	       587.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8        	 1304605	       919.9 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x8-cap8-8         	 2160702	       574.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8        	 1301724	       927.2 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x32-cap8-8        	 2136961	       589.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8       	 1271604	       946.2 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x128-cap8-8       	 2077538	       573.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8      	 1239555	       960.5 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x1-cap64-8        	 4781880	       249.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8       	 8263513	       139.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap64-8        	 4787172	       250.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8       	 6452523	       184.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap64-8        	 4498858	       248.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8       	 4751376	       258.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap64-8       	 4800966	       248.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8      	 1000000	      1000 ns/op	     161 B/op	       1 allocs/op
BenchmarkGetPut/static/x128-cap64-8      	 4821624	       518.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8     	 1000000	      1012 ns/op	     162 B/op	       2 allocs/op
BenchmarkGetPut/static/x1-cap512-8       	 3742974	       321.8 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8      	 8500825	       146.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap512-8       	 3428592	       341.5 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8      	 5363811	       210.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap512-8       	 3660873	       341.5 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8      	 3887805	       298.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap512-8      	 3515707	       342.6 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8     	 2413462	       495.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap512-8     	 3480156	       351.2 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8    	 1000000	      1120 ns/op	     135 B/op	       1 allocs/op
PASS
ok  	vitess.io/vitess/go/pools	49.667s

Aug 09 '22 14:08 harshit-gangal

Note that right now all the remaining benchmarks that allocate are significantly slower than their old static counterparts. You need to fix those too! Once they don't allocate they will have very similar performance. :+1:

Aug 09 '22 14:08 vmg

Benchmark after last commit. That changes waiting connection request logic to have common connection receiver channel.

goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8         	 4396682	       249.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8        	 7383310	       148.1 ns/op	       7 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap8-8         	 1976714	       603.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8        	 1382004	       851.3 ns/op	      46 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap8-8         	 2023770	       589.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8        	 1403683	       843.2 ns/op	      43 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap8-8        	 2053064	       571.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8       	 1402099	       845.6 ns/op	      43 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap8-8       	 2159823	       589.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8      	 1371255	       878.6 ns/op	      44 B/op	       0 allocs/op
BenchmarkGetPut/static/x1-cap64-8        	 4461562	       249.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8       	 7801064	       146.4 ns/op	       7 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap64-8        	 4456218	       248.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8       	 6059944	       190.2 ns/op	       9 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap64-8        	 4450134	       249.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8       	 4528390	       260.6 ns/op	      12 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap64-8       	 4471232	       250.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8      	 1338823	       889.8 ns/op	      47 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap64-8      	 4815217	       501.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8     	 1290224	       914.8 ns/op	      46 B/op	       0 allocs/op
BenchmarkGetPut/static/x1-cap512-8       	 4137897	       289.8 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8      	 7847911	       145.0 ns/op	       7 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap512-8       	 4126407	       327.5 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8      	 5106384	       213.4 ns/op	      10 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap512-8       	 3619280	       371.7 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8      	 3771500	       298.5 ns/op	      14 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap512-8      	 3516030	       345.1 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8     	 2419420	       486.9 ns/op	      23 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap512-8     	 3407198	       386.8 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8    	 1127898	      1049 ns/op	      51 B/op	       0 allocs/op
PASS

Aug 10 '22 07:08 harshit-gangal

After reducing the common channel size to 4 times the maximum capacity

goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/pools
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkGetPut/static/x1-cap8-8         	 4172858	       251.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap8-8        	 8175582	       146.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap8-8         	 2073106	       575.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap8-8        	 1421370	       839.3 ns/op	       6 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap8-8         	 2131052	       570.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap8-8        	 1429699	       841.8 ns/op	       3 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap8-8        	 2173950	       568.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap8-8       	 1423432	       840.4 ns/op	       3 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap8-8       	 2084209	       585.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap8-8      	 1382991	       858.7 ns/op	       3 B/op	       0 allocs/op
BenchmarkGetPut/static/x1-cap64-8        	 4837154	       248.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x1-cap64-8       	 8223388	       145.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap64-8        	 4842416	       268.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x4-cap64-8       	 6349221	       189.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap64-8        	 4840467	       267.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x8-cap64-8       	 4507173	       263.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap64-8       	 4520050	       268.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x32-cap64-8      	 1351778	       893.3 ns/op	       5 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap64-8      	 4799488	       560.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/dynamic/x128-cap64-8     	 1000000	      1082 ns/op	       3 B/op	       0 allocs/op
BenchmarkGetPut/static/x1-cap512-8       	 3203734	       352.8 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x1-cap512-8      	 6901803	       154.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x4-cap512-8       	 3792057	       349.2 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x4-cap512-8      	 5348278	       210.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x8-cap512-8       	 3303650	       379.9 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x8-cap512-8      	 3957998	       302.3 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x32-cap512-8      	 3486044	       380.2 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x32-cap512-8     	 2480737	       491.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetPut/static/x128-cap512-8     	 3116094	       376.9 ns/op	      24 B/op	       3 allocs/op
BenchmarkGetPut/dynamic/x128-cap512-8    	 1000000	      1268 ns/op	       2 B/op	       0 allocs/op
PASS

Aug 10 '22 08:08 harshit-gangal

The new pool implementation based on the database/sql design from golang core is efficient when the number of threads pounding the pool is less < 32
The old pool implementation being lock free does not block the threads on mutex hence perform better for higher threads.
The modified old pool implementation (learnings from new pool applied for old pool) does not give significant improvement on any of the front

Here are two things:

come up with a lock free slice implementation which performs better for high number of parallelism a. I tried the library from here https://github.com/golang-design/lockfree - this improved performance for higher threads but still was not close enough with old pool.
Meantime, I plan to modify the old pool implementation and add connection holder for connections which have different settings applied. The reason is that old pool is tested in production over the new pool.

Aug 19 '22 13:08 harshit-gangal

This is not actively developed. Closing for now

Sep 13 '22 17:09 harshit-gangal

vitess vitess copied to clipboard

dynamic pool implementation

Description

Related Issue(s)

Checklist

Deployment Notes

Review Checklist

General

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

vitess
vitess copied to clipboard