grpc-go
grpc-go copied to clipboard
rpc_util: Reuse memory buffer for receiving message
This PR aims to reduce memory allocation in the receive message process. Using this with a stream-heavy workload can improve performance significantly.
RELEASE NOTES:
- rpc_util: reduce memory allocation for the message parsing process in streaming RPC
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: hueypark / name: Jaewan Park (493be5db050dbbc964b7ae90797057228ffe6daa, ecdb6d229520e6ed0b5830243f81ed98df2f250e, cb8827d8caae66a209b3cbf6d4e430f5d69b0a5d, bae37e3cae1d8eacd83e09c6077ee72fba78ec1c, cbaef54d582528b628acc40d6ac89519279446fa, f9730dd2fd5793abc0d4ca70b9893b8823442889, e7462501868ba7985b6e62593448e6ef5362be63, 711db78346a7cbee693dd11f7751d64c9ab49600, c05feed5c94abfc3d94c8eac26af4f1be359ce03, 9390057ebdb8db31d43c214bdb36bd48f8b3d2c4, b1c649610ee50fa5fe898e343e3fd40d6b728c88, 086dd289b748968ed5bdbc1fc90da97f6d7e539a, d6c1d9743ccbf81133e4a03ebf0310d180947ef2, 1b11bba548a9fc362c14bdb4844d25e3971183a8, a44c2005e3ada9f57680c1ac8ae8678e1963f253, 6b5000f634b92d5c745f517db92fcaed20c16ddc, c34da60d9f4432904743212fe385af436c38e02f, acec62662063acfd13ec30e74ebec877efadd777, 3cbb6b5df536f944b53c9bec30abbacefe54b1d9, 57b9c67038c51d0e9df37d9932107e99ae28628c, 76caf7405ba2e91dd7b4e9d31304b63f1d199d7f, 8f33b9bb821eaea4e5881861170005d0830b861a, 515556682270ce11a49bea548a2d31ca7eccdbe6, 15f820e51f3622eb0a79185fd985d117f4f1c044, 539eef3ec3bccc352e9f9124deda8208ffab0aa3, 056b3e5931f55e95a4d19bc1a1ce83e6fb7bc4bb, 0d07cfc3dfcc1bc2c21890e7d1efc7c40b404e1e, 7f6fdb28c4c31f6329505ae36a0f881c3a80a962, ca9f6deb694da7fe6c68a71e377e71e68b372247, 9be78899e3d38da51d696c13f89390d267c01da1, 25b60e39d0d54fcaebfe2389b517826c22ebaade, 8e8f683ac72b7b74e0bbcee2b6e14fce116dab65, 96f5a276daf93ec80162d7e58903abef081a1c72, fe1294f584874f68b58cd889c3013fc784acd995, 55354169ca42b4f3abbbf78debe9a89c4cd0ac1f, 2678efb1a7844ea090938abdb1d3e0f7288f43b4, 8199eebb1420f2e9058c5e9bc7cf6bc960821e77, a70df178eda2a4ce65d0fdbd4ebb55c79b7cd799, 86d999fa9a50774867c65c4b35e841316a530eff, 63a360ebbaf2285355f48a2a1f0aeadc6b2b6b73, 1f4bc3550a6ef2915b5c9da55279d73c209575d7, 3dcd8331869d75d5a00a63024aca8efd4346cdfc)
Could you please explain what problem you are trying to solve by making this change? Maybe we can have a discussion and see if the current approach is the way to go or if a different approach would serve better.
Using this with a stream-heavy workload can improve performance significantly.
Do you have any data to show that this change improves performance? And if so, could you please share that with us. Also, we do have a benchmark suite that you can use to run your changes against. Thanks.
I implemented a feature stream counts at the benchmarkmain and ran my workloads. The following is the result of the benchmark.
unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1048576B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 9251 10583 14.40%
RecvOps 8150 11626 42.65%
Bytes/op 5323921.33 3288925.84 -38.22%
Allocs/op 621.94 610.97 -1.77%
ReqT/op 7760301260.80 8877663846.40 14.40%
RespT/op 6836715520.00 9752595660.80 42.65%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
And I have attached all of the benchmark results. cpuProfAfter.zip
I implemented a feature stream counts at the benchmarkmain and ran my workloads.
What does this feature do?
And I have attached all of the benchmark results. cpuProfAfter.zip
This zip file contains a single file and I don't know how to view the results in there.
The benchmarks results that you have posted here are for a very specific case. It uses the following settings among others: workload type: unconstrained benchmark run time: 10s concurrent calls: 1 req size: 1048576B resp size: 1048576B
We would really like to see more comprehensive benchmark runs for a code change which is as fundamental as yours. Also, a benchmark run time of 10s
is simply not good enough.
Also, could you please explain why you want to make the change that you are making? Are you seeing some performance bottlenecks when running specific workloads?
What does this feature do?
This feature allows the stream API to send multiple streams. In real-world scenarios, API users may not only send a single message with the stream.
This zip file contains a single file and I don't know how to view the results in there.
I apologize. It would probably be better to reproduce the issue locally. I will include the necessary commands for reproduction.
We would really like to see more comprehensive benchmark runs for a code change which is as fundamental as yours. Also, a benchmark run time of 10s is simply not good enough.
I aggred. I added a benchmark run time of 1m.
result:
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 31405 31349 -0.18%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18072.12 18104.19 0.18%
Allocs/op 216.46 215.70 -0.46%
ReqT/op 4187333.33 4179866.67 -0.18%
RespT/op 4187333.33 4179866.67 -0.18%
50th-Lat 1.903201ms 1.90694ms 0.20%
90th-Lat 1.958722ms 1.960585ms 0.10%
99th-Lat 2.007894ms 2.015896ms 0.40%
Avg-Lat 1.910174ms 1.913565ms 0.18%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 3934 3920 -0.36%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 68941.53 52429.05 -23.95%
Allocs/op 468.97 449.64 -4.05%
ReqT/op 524533.33 522666.67 -0.36%
RespT/op 524533.33 522666.67 -0.36%
50th-Lat 15.209591ms 15.232099ms 0.15%
90th-Lat 15.333759ms 15.356509ms 0.15%
99th-Lat 16.770753ms 17.077881ms 1.83%
Avg-Lat 15.252835ms 15.309135ms 0.37%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 54388 51778 -4.80%
RecvOps 77008 77023 0.02%
Bytes/op 6123.87 4086.67 -33.26%
Allocs/op 21.29 19.23 -9.40%
ReqT/op 7251733.33 6903733.33 -4.80%
RespT/op 10267733.33 10269733.33 0.02%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 2602356 2842256 9.22%
RecvOps 2664021 2037056 -23.53%
Bytes/op 6315.69 3869.00 -38.74%
Allocs/op 20.39 17.48 -14.71%
ReqT/op 346980800.00 378967466.67 9.22%
RespT/op 355202800.00 271607466.67 -23.53%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 90 90 0.00%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 70131.82 53823.20 -23.25%
Allocs/op 551.42 533.84 -3.26%
ReqT/op 12000.00 12000.00 0.00%
RespT/op 12000.00 12000.00 0.00%
50th-Lat 668.543496ms 667.398531ms -0.17%
90th-Lat 674.226219ms 671.845385ms -0.35%
99th-Lat 677.662648ms 675.277176ms -0.35%
Avg-Lat 669.060189ms 668.227508ms -0.12%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 76866 76860 -0.01%
RecvOps 76950 77004 0.07%
Bytes/op 5909.29 3845.26 -34.93%
Allocs/op 19.95 17.88 -10.03%
ReqT/op 10248800.00 10248000.00 -0.01%
RespT/op 10260000.00 10267200.00 0.07%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 607492 605531 -0.32%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18057.20 18106.91 0.27%
Allocs/op 215.88 215.90 0.00%
ReqT/op 80998933.33 80737466.67 -0.32%
RespT/op 80998933.33 80737466.67 -0.32%
50th-Lat 95.001µs 95.301µs 0.32%
90th-Lat 108.75µs 109.127µs 0.35%
99th-Lat 223.706µs 225.913µs 0.99%
Avg-Lat 98.429µs 98.753µs 0.33%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 17801533 23479365 31.90%
RecvOps 14600691 18649868 27.73%
Bytes/op 5866.95 3813.91 -34.99%
Allocs/op 20.04 17.69 -14.97%
ReqT/op 2373537733.33 3130582000.00 31.90%
RespT/op 1946758800.00 2486649066.67 27.73%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 92 91 -1.09%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 70343.83 53902.51 -23.37%
Allocs/op 566.60 540.46 -4.59%
ReqT/op 12266.67 12133.33 -1.08%
RespT/op 12266.67 12133.33 -1.08%
50th-Lat 655.037524ms 666.202283ms 1.70%
90th-Lat 656.737879ms 667.99468ms 1.71%
99th-Lat 657.299189ms 669.693627ms 1.89%
Avg-Lat 654.943331ms 665.739009ms 1.65%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 731 722 -1.23%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18071.45 18069.62 -0.01%
Allocs/op 217.97 215.36 -0.92%
ReqT/op 97466.67 96266.67 -1.23%
RespT/op 97466.67 96266.67 -1.23%
50th-Lat 82.127316ms 83.219196ms 1.33%
90th-Lat 82.400964ms 84.170818ms 2.15%
99th-Lat 82.576049ms 84.53085ms 2.37%
Avg-Lat 82.161793ms 83.147415ms 1.20%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 117155 120640 2.97%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 69970.61 53383.67 -23.71%
Allocs/op 522.46 499.58 -4.40%
ReqT/op 15620666.67 16085333.33 2.97%
RespT/op 15620666.67 16085333.33 2.97%
50th-Lat 494.255µs 482.374µs -2.40%
90th-Lat 541.998µs 522.98µs -3.51%
99th-Lat 917.151µs 888.174µs -3.16%
Avg-Lat 511.8µs 497.027µs -2.89%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 736 722 -1.90%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18025.93 18032.04 0.04%
Allocs/op 214.81 213.43 -0.47%
ReqT/op 98133.33 96266.67 -1.90%
RespT/op 98133.33 96266.67 -1.90%
50th-Lat 81.676483ms 83.26441ms 1.94%
90th-Lat 82.441415ms 84.380606ms 2.35%
99th-Lat 82.518705ms 84.539625ms 2.45%
Avg-Lat 81.612729ms 83.167132ms 1.90%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
commands:
go run benchmark/benchmain/main.go -benchtime=1m \
-workloads=all -preloader=on \
-maxConcurrentCalls=1 \
-reqSizeBytes=1000 -respSizeBytes=1000 -streamCounts=8 \
-resultFile=beforeResult
-- after checkout use buffer commit
go run benchmark/benchmain/main.go -benchtime=1m \
-workloads=all -preloader=on \
-maxConcurrentCalls=1 \
-reqSizeBytes=1000 -respSizeBytes=1000 -streamCounts=8 \
-resultFile=afterResult
go run benchmark/benchresult/main.go beforeResult afterResult
Also, could you please explain why you want to make the change that you are making? Are you seeing some performance bottlenecks when running specific workloads?
I am developing an OLAP database and our map reduce framework relies on gRPC to send and receive large amounts of data. Recently, we discovered that gRPC is allocating unnecessary memory for the stream API, which is causing our system to slow down. To improve performance, we want to reduce the memory allocation.
Actual code link:
- go source: https://github.com/ab180/lrmr/blob/fb116c37932313e579264d34c0770fc2313e4d8d/output/push_stream.go#L63
- rpc: https://github.com/ab180/lrmr/blob/fb116c37932313e579264d34c0770fc2313e4d8d/lrmrpb/rpc.proto#L14
Could you please send our your changes to the benchmarking code as a separate PR. Thanks.
Done with https://github.com/grpc/grpc-go/pull/5898. I greatly appreciate your review.
The problem with the approach taken in this PR is that once a really big message is received on the stream, the memory allocated for it will never be released. So, let's say a stream sends a 4MB message to start with, and then only sends 1B messages for the rest of its life, we will still hold on to the 4MB buffer.
We have discussed options for this in the past, and the option that seems like it would work best is as follows:
- have a bunch of
sync.Pool
s that cache recv buffers - each
sync.Pool
would cache buffers of a fixed size - and we would have
sync.Pool
s for logarithmically increasing buffer sizes. For example, we might have a set ofsync.Pool
s for sizes 10B, 100B, 1KB, 10KB, 100KB etc ... - each time we read the headers, and determine the length of the message, we would get the most appropriately sized buffer from the corresponding
sync.Pool
, and once we are done using the message, we would release it back to the pool
The above described change is more involved and would involve more benchmarking. Please let us know if you are interested in this, and we can discuss more before you get started.
I am interested. Please let me know what the next step is so I can proceed and resolve the issue.
ping @easwars to follow up
I am interested. Please let me know what the next step is so I can proceed and resolve the issue.
A brief overview of the approach that we would like to take is described here: https://github.com/grpc/grpc-go/pull/5862#issuecomment-1386243942. Once you get more familiar with the changes that would be needed to implement that approach, we can discuss again, and talk in greater depth about the proposed set of changes that you intend to make.
One thing we would like to mention up front is that, if there is no significant improvement in the benchmarks after doing all that work, we might have to abandon it. That is something that we would have to live with.
Please let us know when you have a better understanding of the changes to be made, and we can either discuss it here or you can open a new issue for that.
Thanks.
A brief overview of the approach that we would like to take is described here: #5862 (comment). Once you get more familiar with the changes that would be needed to implement that approach, we can discuss again, and talk in greater depth about the proposed set of changes that you intend to make.
One thing we would like to mention up front is that, if there is no significant improvement in the benchmarks after doing all that work, we might have to abandon it. That is something that we would have to live with.
Please let us know when you have a better understanding of the changes to be made, and we can either discuss it here or you can open a new issue for that.
Thanks.
I am currently utilizing a sync.Pool
for this implementation. Because I am uncertain if implementing multiple fixed-size sync.Pool
's would result in a reduction of memory usage throughout the program.
Thankfully, the benchmark results are outstanding.
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 736 735 -0.14%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 9888.73 7889.85 -20.21%
Allocs/op 73.52 72.59 -1.36%
ReqT/op 98133.33 98000.00 -0.14%
RespT/op 98133.33 98000.00 -0.14%
50th-Lat 81.595871ms 81.625ms 0.04%
90th-Lat 82.288644ms 82.302893ms 0.02%
99th-Lat 82.387358ms 82.401457ms 0.02%
Avg-Lat 81.592392ms 81.652686ms 0.07%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 2857150 3076809 7.69%
RecvOps 6389357 5638488 -11.75%
Bytes/op 6882.22 4605.70 -33.09%
Allocs/op 21.77 21.18 0.00%
ReqT/op 380953333.33 410241200.00 7.69%
RespT/op 851914266.67 751798400.00 -11.75%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 76867 76856 -0.01%
RecvOps 77063 76732 -0.43%
Bytes/op 5901.64 3896.63 -33.97%
Allocs/op 19.90 19.89 0.00%
ReqT/op 10248933.33 10247466.67 -0.01%
RespT/op 10275066.67 10230933.33 -0.43%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 31407 31456 0.16%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18056.47 17702.29 -1.96%
Allocs/op 216.16 218.27 0.93%
ReqT/op 4187600.00 4194133.33 0.16%
RespT/op 4187600.00 4194133.33 0.16%
50th-Lat 1.902471ms 1.902914ms 0.02%
90th-Lat 1.960533ms 1.958071ms -0.13%
99th-Lat 2.04899ms 2.012327ms -1.79%
Avg-Lat 1.910015ms 1.907077ms -0.15%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 31522 31559 0.12%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 9724.26 7720.42 -20.61%
Allocs/op 61.73 61.27 0.00%
ReqT/op 4202933.33 4207866.67 0.12%
RespT/op 4202933.33 4207866.67 0.12%
50th-Lat 1.897042ms 1.897172ms 0.01%
90th-Lat 1.953226ms 1.953482ms 0.01%
99th-Lat 2.016064ms 1.991549ms -1.22%
Avg-Lat 1.902967ms 1.900766ms -0.12%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 736 735 -0.14%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 17994.00 17787.98 -1.15%
Allocs/op 213.51 215.05 0.94%
ReqT/op 98133.33 98000.00 -0.14%
RespT/op 98133.33 98000.00 -0.14%
50th-Lat 81.557537ms 81.671239ms 0.14%
90th-Lat 82.308758ms 82.404262ms 0.12%
99th-Lat 82.543432ms 82.571359ms 0.03%
Avg-Lat 81.579228ms 81.727436ms 0.18%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 52201 52035 -0.32%
RecvOps 77056 77022 -0.04%
Bytes/op 6138.42 4144.01 -32.48%
Allocs/op 21.19 21.30 0.00%
ReqT/op 6960133.33 6938000.00 -0.32%
RespT/op 10274133.33 10269600.00 -0.04%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 731 731 0.00%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18040.90 17777.13 -1.46%
Allocs/op 217.93 220.37 1.38%
ReqT/op 97466.67 97466.67 0.00%
RespT/op 97466.67 97466.67 0.00%
50th-Lat 82.080816ms 82.17037ms 0.11%
90th-Lat 82.252643ms 82.298411ms 0.06%
99th-Lat 82.492231ms 82.585971ms 0.11%
Avg-Lat 82.089176ms 82.186704ms 0.12%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 728 729 0.14%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 9916.45 7894.83 -20.39%
Allocs/op 73.47 73.11 0.00%
ReqT/op 97066.67 97200.00 0.14%
RespT/op 97066.67 97200.00 0.14%
50th-Lat 82.254364ms 82.170236ms -0.10%
90th-Lat 83.288298ms 83.179603ms -0.13%
99th-Lat 83.9771ms 83.946449ms -0.04%
Avg-Lat 82.461816ms 82.383182ms -0.10%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 540539 544492 0.73%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 18044.52 17782.27 -1.45%
Allocs/op 214.97 218.02 1.86%
ReqT/op 72071866.67 72598933.33 0.73%
RespT/op 72071866.67 72598933.33 0.73%
50th-Lat 102.6µs 104.938µs 2.28%
90th-Lat 126.202µs 125.717µs -0.38%
99th-Lat 279.492µs 256.144µs -8.35%
Avg-Lat 110.634µs 109.812µs -0.74%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 869136 849326 -2.28%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 9842.30 7849.17 -20.25%
Allocs/op 68.75 68.31 0.00%
ReqT/op 115884800.00 113243466.67 -2.28%
RespT/op 115884800.00 113243466.67 -2.28%
50th-Lat 65.393µs 68.259µs 4.38%
90th-Lat 76.473µs 78.303µs 2.39%
99th-Lat 144.69µs 117.089µs -19.08%
Avg-Lat 68.728µs 70.309µs 2.30%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
Title Before After Percentage
TotalOps 0 0 NaN%
SendOps 14678884 18847588 28.40%
RecvOps 13348284 15292041 14.56%
Bytes/op 5925.54 3877.99 -34.56%
Allocs/op 20.23 19.77 -4.94%
ReqT/op 1957184533.33 2513011733.33 28.40%
RespT/op 1779771200.00 2038938800.00 14.56%
50th-Lat 0s 0s NaN%
90th-Lat 0s 0s NaN%
99th-Lat 0s 0s NaN%
Avg-Lat 0s 0s NaN%
GoVersion go1.19.4 go1.19.4
GrpcVersion 1.53.0-dev 1.53.0-dev
commands:
go run benchmark/benchmain/main.go -benchtime=1m \
-workloads=all -preloader=on \
-maxConcurrentCalls=1 \
-reqSizeBytes=1000 -respSizeBytes=1000 \
-resultFile=beforeResult
-- after checkout use `sync.Pool` commit
go run benchmark/benchmain/main.go -benchtime=1m \
-workloads=all -preloader=on \
-maxConcurrentCalls=1 \
-reqSizeBytes=1000 -respSizeBytes=1000 \
-resultFile=afterResult
go run benchmark/benchresult/main.go beforeResult afterResult
Thank you for trying the sync.Pool
approach. But, we would still not be able to accept the PR without multiple sync pools for different sized messages because this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer.
One benchmark showing improved results does not mean that a change is appropriate for all use cases.
this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer
I don't think this is true, now. A GC will recycle anything in the pool, which is what we want, and avoids keeping a big buffer in memory forever.
However, it still uses a single simple sync.Pool
, but with any potential buffer size, so that when Get
is called, it pulls anything out of the pool and then resizes it if needed and throws the old one away if it's too small. So while this works, and will look good for some workloads, it might have problems with other workloads. E.g. two concurrent streams, one with large messages and one with small messages, could result in poor usage patterns. Creating multiple pools and retrieving from the appropriate pool might work, but then we need to create a large number of sync.Pool
s for that, and I don't know whether the tradeoff there is desirable or not. It would take some real-world measurements, probably.
Another thing I'd like is a way to leave this off (at least initially) by default, but have some kind of setting to enable it. If it turns out to be mostly beneficial, we would enable it by default and provide a corresponding way to disable it.
@easwars
But, we would still not be able to accept the PR without multiple sync pools for different sized messages because this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer.
I implemented pools of varying sizes for the parser. cb8827d8caae66a209b3cbf6d4e430f5d69b0a5d
@dfawley
Another thing I'd like is a way to leave this off (at least initially) by default, but have some kind of setting to enable it. If it turns out to be mostly beneficial, we would enable it by default and provide a corresponding way to disable it.
I introduced the "useBytesPoolForParser" option and set its default value to false. bae37e3cae1d8eacd83e09c6077ee72fba78ec1c
@easwars
Also, we should think about a more comprehensive benchmarking plan for these changes Even though we are gating it behind a dial option, having some good benchmark data can be useful for us to be convinced about the need for these changes, but also as something that we can point our users to when they have a hard time deciding whether they want to use this feature.
In commit c05feed5c94abfc3d94c8eac26af4f1be359ce03, I added a feature for shared receive buffers (nil and simple) for the benchmark. It's possible that we could add more shared receive buffer options for the benchmark in the future with ease.
I also have a benchmark result for a specific case to share.
command:
go run benchmark/benchmain/main.go -benchtime=1m \
-workloads=unconstrained -preloader=on \
-maxConcurrentCalls=1 \
-kbps=0 -latency=0ms \
-reqSizeBytes=1000 -respSizeBytes=1000
result:
go1.20.1/grpc1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sharedRecvBufferPool_nil-:
Bytes/op: 5924.858174205665 Allocs/op: 21.944834484531267
Histogram (empty)
Number of requests: 16624263 Request throughput: 2.2165684e+09 bit/s
Number of responses: 13076104 Response throughput: 1.7434805333333333e+09 bit/s
go1.20.1/grpc1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sharedRecvBufferPool_simple-:
Bytes/op: 3884.258894307068 Allocs/op: 19.696514863496976
Histogram (empty)
Number of requests: 19910184 Request throughput: 2.6546912e+09 bit/s
Number of responses: 15540020 Response throughput: 2.0720026666666667e+09 bit/s
@easwars Would it be possible to consider releasing this in Milestone 1.54? The ETA is crucial information that would greatly benefit the performance of my project. I would appreciate it if you could kindly provide an answer to this question.
@easwars Would it be possible to consider releasing this in Milestone 1.54? The ETA is crucial information that would greatly benefit the performance of my project. I would appreciate it if you could kindly provide an answer to this question.
We are going to be cutting the release branch for 1.54 either today or tomorrow (and we are already late by a week). We only cherrypick important bug fixes into the release branch, after the branch cut and before the actual release. So, I cannot guarantee at this point that we will be able to get this PR into 1.54.
I think the changes are mostly moving towards where we want it to be. Thanks for doing this.
Thank you for taking the time to review my work! It's really motivating to know that we're making progress and getting closer to the finish line.
In addition to the inline comments, there are no tests added in this PR. Let's make sure there's coverage for anything we add.
I have added a test for the shared buffer pool. If you require any additional tests, please feel free to let me know.
I have added a test for the shared buffer pool. If you require any additional tests, please feel free to let me know.
I think it would be useful to have a test at the gRPC API level.. Just something that does a few (10?) RPCs with a shared buffer pool used in the client and server. You could copy something like this test case and hopefully it won't take long to do as this should be even simpler than that one.
I think it would be useful to have a test at the gRPC API level.. Just something that does a few (10?) RPCs with a shared buffer pool used in the client and server. You could copy something like this test case and hopefully it won't take long to do as this should be even simpler than that one.
I have implemented a new test TestRecvBufferPool
. 8e8f683ac72b7b74e0bbcee2b6e14fce116dab65
@dfawley Thank you for your review!
Thank you. We can certainly wait for @easwars to return for the final review. His insights would be valuable.
Thank you for this PR! I'm testing this patch in our dev environment. Description mentions "streaming rpc", but code to get buffer from the pool is also hit for unary methods, and experiment confirms that. It wasn't clear to me whether this would work for unary methods too. Perhaps the description could clarify that?
Update: Summary of my findings:
- On server side, pooling of buffers is supported for streaming methods. For unary methods, received messages are put into buffer from the pool, but buffer is never returned back to pool.
- On client side, pooling seem to work for messages received for both streaming and unary methods.
Also, looks like there are some merge conflicts in the benchmark files. I would be OK to move them to separate PR as well, if that makes life easier. Thanks.
Also, looks like there are some merge conflicts in the benchmark files. I would be OK to move them to separate PR as well, if that makes life easier. Thanks.
I have made a commit to resolve the conflicts. 86d999fa9a50774867c65c4b35e841316a530eff
Thank you very much for your contribution @hueypark and apologies for the really drawn out review process on this PR.