mem: Add more tiers to the default buffer pool
Currently, performance is very sensitive to payload size. If the total payload is 32KiB+1byte, the code gets (or allocates) a 1MiB buffer. Then it clears the whole thing but only uses ~3% of it. Adding more tiers between 32KiB and 1MiB makes these requests 30% faster in your benchmarks, and speeds up 256KiB requests by 10% in my benchmarks.
Just adding the tiers causes a 1% regression for 1-byte requests, but switching to slices.BinarySearch more than recovers this. I confirmed that this has the same behavior with https://go.dev/play/p/i7Q9ItQVu5T.
Note that both the old and new search have a potential bug when calling Put with buffers that don't have a power of 2 capacity. For example, getPool(4097) will return the pool with size 16384, but that pool can't accept the buffer so it will just ignore it. This would only be a problem if some code reslices the capacity of a buffer.
33000 byte request and response benchmarks:
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 67198 89033 32.49%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 235034.69 188930.90 -19.62%
Allocs/op 177.73 181.77 2.25%
ReqT/op 1774027200.00 2350471200.00 32.49%
RespT/op 1774027200.00 2350471200.00 32.49%
50th-Lat 131.034µs 94.934µs -27.55%
90th-Lat 187.464µs 177.865µs -5.12%
99th-Lat 459.922µs 308.028µs -33.03%
Avg-Lat 148.645µs 112.131µs -24.56%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 77039 116228 50.87%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 312723.99 184410.91 -41.03%
Allocs/op 61.74 63.88 3.24%
ReqT/op 2033829600.00 3068419200.00 50.87%
RespT/op 2033829600.00 3068419200.00 50.87%
50th-Lat 108.37µs 72.3µs -33.28%
90th-Lat 166.412µs 131.114µs -21.21%
99th-Lat 526.06µs 270.545µs -48.57%
Avg-Lat 129.656µs 85.868µs -33.77%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_33000B-respSize_33000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 376536 447734 18.91%
RecvOps 413423 575729 39.26%
Bytes/op 103263.42 89198.52 -13.62%
Allocs/op 27.75 28.12 3.60%
ReqT/op 9940550400.00 11820177600.00 18.91%
RespT/op 10914367200.00 15199245600.00 39.26%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
1 byte request and response benchmarks:
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 230977 231843 0.37%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 9235.90 9235.57 0.00%
Allocs/op 155.07 155.07 0.00%
ReqT/op 184781.60 185474.40 0.38%
RespT/op 184781.60 185474.40 0.38%
50th-Lat 41.15µs 41.109µs -0.10%
90th-Lat 55.558µs 55.117µs -0.79%
99th-Lat 91.657µs 88.321µs -3.64%
Avg-Lat 43.152µs 42.98µs -0.40%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 394925 395698 0.20%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 1260.37 1260.38 0.00%
Allocs/op 41.01 41.01 0.00%
ReqT/op 315940.00 316558.40 0.20%
RespT/op 315940.00 316558.40 0.20%
50th-Lat 24.087µs 23.987µs -0.42%
90th-Lat 31.861µs 31.861µs 0.00%
99th-Lat 46.45µs 45.929µs -1.12%
Avg-Lat 25.205µs 25.138µs -0.27%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 12550620 11837068 -5.69%
RecvOps 11448882 11193802 -2.23%
Bytes/op 1075.81 1069.33 -0.56%
Allocs/op 25.07 25.07 0.00%
ReqT/op 10040496.00 9469654.40 -5.69%
RespT/op 9159105.60 8955041.60 -2.23%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
32000 byte benchmarks:
streaming-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 150481 150473 -0.01%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 139452.12 139517.62 0.05%
Allocs/op 42.38 42.41 0.00%
ReqT/op 3852313600.00 3852108800.00 -0.01%
RespT/op 3852313600.00 3852108800.00 -0.01%
50th-Lat 57.943µs 57.952µs 0.02%
90th-Lat 89.133µs 89.744µs 0.69%
99th-Lat 207.272µs 206.561µs -0.34%
Avg-Lat 66.3µs 66.293µs -0.01%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 774812 547494 -29.34%
RecvOps 338049 615464 82.06%
Bytes/op 70072.04 70303.21 0.33%
Allocs/op 20.97 21.25 4.77%
ReqT/op 19835187200.00 14015846400.00 -29.34%
RespT/op 8654054400.00 15755878400.00 82.06%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_32000B-respSize_32000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBufferPool_simple-sharedWriteBuffer_false
Title Before After Percentage
TotalOps 109196 110357 1.06%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 147410.90 147495.33 0.06%
Allocs/op 160.25 160.28 0.00%
ReqT/op 2795417600.00 2825139200.00 1.06%
RespT/op 2795417600.00 2825139200.00 1.06%
50th-Lat 78.342µs 77.651µs -0.88%
90th-Lat 140.242µs 138.779µs -1.04%
99th-Lat 267.018µs 264.033µs -1.12%
Avg-Lat 91.406µs 90.436µs -1.06%
GoVersion go1.25.5 go1.25.5
GrpcVersion 1.79.0-dev 1.79.0-dev
RELEASE NOTES:
- mem: add more tiers to the default buffer pool
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 83.32%. Comparing base (9fd5d8a) to head (12af9d7).
:warning: Report is 5 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #8770 +/- ##
==========================================
- Coverage 83.36% 83.32% -0.04%
==========================================
Files 419 419
Lines 32548 32548
==========================================
- Hits 27134 27122 -12
- Misses 4028 4033 +5
- Partials 1386 1393 +7
| Files with missing lines | Coverage Δ | |
|---|---|---|
| mem/buffer_pool.go | 96.61% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I should mention that grpc-go users can override the default pool for clients and servers with options, but the same isn't possible for the DefaultBufferPool() uses in the proto encoder. I could rewrite the proto encoder, but it seems better for the default performance to be faster.
@arjan-bal
We probably just went with whatever bucket sizes @PapaCharlie came up with when the initial buffer slicing implementation was done. The change looks OK to me, but I'll punt it to you for the time being, to see if you have more insights here. Thanks.
LGTM! I think I just reused whatever sizes were previously defined. Having more seems fine. However, I don't think there's functionally a difference between sort.Search and slices.BinarySearchFunc? I would be wary of using one or the other as you could incur unexpected heap allocations which will kill your perf. Might be worth it to write a small benchmark comparing the two, just to see if there's a meaningful difference.
Note that both the old and new search have a potential bug when calling Put with buffers that don't have a power of 2 capacity. For example, getPool(4097) will return the pool with size 16384, but that pool can't accept the buffer so it will just ignore it. This would only be a problem if some code reslices the capacity of a buffer.
You are 100% right. tieredBufferPool.Put and tieredBufferPool.Get need to use different implementations of getPool, where the former needs to use the pool with the largest buffers that are still smaller than the input buffer’s capacity, while the latter needs the pool with the smallest buffers that are still larger than the input buffer’s capacity. Otherwise resized buffers get dropped on the floor.
func BenchmarkSearch(b *testing.B) {
p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
sizes := []int{defaultBufferPoolSizes[0] - 1}
for _, size := range defaultBufferPoolSizes {
sizes = append(sizes, size+1)
}
b.Run("func=slices.BinarySearch", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
slices.BinarySearchFunc(p.sizedPools, size, func(pool *sizedBufferPool, target int) int {
return pool.defaultSize - target
})
}
}
})
b.Run("func=sort.Search", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
sort.Search(len(p.sizedPools), func(i int) bool {
return p.sizedPools[i].defaultSize >= size
})
}
}
})
}
According to this, which should be comparing the two apples-to-apples, it seems that sort.Search is actually marginally faster, albeit very marginally:
% go test -bench=BenchmarkSearch -count=10 -benchmem | benchstat -col '/func' -
goos: darwin
goarch: arm64
pkg: whatever
cpu: Apple M4 Max
│ slices.BinarySearch │ sort.Search │
│ sec/op │ sec/op vs base │
Search-16 62.81n ± 2% 60.23n ± 0% -4.11% (p=0.000 n=10)
│ slices.BinarySearch │ sort.Search │
│ B/op │ B/op vs base │
Search-16 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal
│ slices.BinarySearch │ sort.Search │
│ allocs/op │ allocs/op vs base │
Search-16 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal
So it's totally up to you. If you find slices.BinarySearch to be more legible then it's a pretty reasonable change, since they do exactly the same thing.
It's possible to make the search even faster since we're using powers of 2.
func BenchmarkSearch(b *testing.B) {
p := NewTieredBufferPool(defaultBufferPoolSizes...).(*tieredBufferPool)
sizes := []int{defaultBufferPoolSizes[0] - 1}
for _, size := range defaultBufferPoolSizes {
sizes = append(sizes, size+1)
}
b.Run("func=slices.BinarySearch", func(b *testing.B) {
// omitted
})
b.Run("func=sort.Search", func(b *testing.B) {
// omitted
})
// Ensure all sizes are powers of 2.
for _, size := range defaultBufferPoolSizes {
if bits.OnesCount(uint(size)) != 1 {
b.Fatalf("size %d is not a power of 2", size)
}
}
// Make a slice that maps from highest set bit to index in the defaultBufferPoolSizes slice.
// This is equivalent to the sort.Search call, but more efficient.
upperBound := make([]int, 64)
for i := 0; i < 64; i++ {
upperBound[i] = -1
}
for i, size := range defaultBufferPoolSizes {
upperBound[bits.Len(uint(size))] = i
}
for i := 62; i >= 0; i-- {
if upperBound[i] == -1 {
upperBound[i] = upperBound[i+1]
}
}
b.Run("func=custom", func(b *testing.B) {
for b.Loop() {
for _, size := range sizes {
query := size - 1
_ = upperBound[bits.Len(uint(query))+1]
}
}
})
}
Benchmark results with the list of tier sizes from this PR.
$ go test -bench=BenchmarkSearch -count=10 -benchmem | benchstat -col '/func' -
goos: linux
goarch: amd64
pkg: google.golang.org/grpc/mem
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
│ slices.BinarySearch │ sort.Search │ custom │
│ sec/op │ sec/op vs base │ sec/op vs base │
Search-48 121.850n ± 0% 99.335n ± 0% -18.48% (p=0.000 n=10) 7.633n ± 2% -93.74% (p=0.000 n=10)
│ slices.BinarySearch │ sort.Search │ custom │
│ B/op │ B/op vs base │ B/op vs base │
Search-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal
│ slices.BinarySearch │ sort.Search │ custom │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
Search-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ 0.000 ± 0% ~ (p=1.000 n=10) ¹
¹ all samples are equal
Since the tiered buffer pool constructor doesn't restrict the tier sizes to be powers of 2, we could store a boolean indicating whether all sizes are powers of 2 and have a fast path for that uses bit manipulation.
We discussed this issue during the maintainers' meeting and believe a better solution is to provide an experimental API that allows users to update the default buffer pool in an init function. This can be achieved by exposing the existing test-only function through the experimental package.
The concern is that adding new tiers to the default pool can lead to lower buffer reuse, as buffers may end up distributed across more tiers. Existing benchmarks don't catch such problems because they use a single payload size for the entire run. By allowing users to configure custom tiers, we can avoid the risk of unintended regressions for the wider user base.
Well you could make the constructor for it accept a list of powers of 2 instead of arbitrary sizes :) then you're guaranteed it works!
I've raise https://github.com/grpc/grpc-go/pull/8775 which adds a tiered buffer pool that supports only powers of 2. That should allow adding new tiers without a performance cost for looking up the required sizedBufferPool.