grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

rpc_util: Reuse memory buffer for receiving message

Open hueypark opened this issue 1 year ago • 27 comments

This PR aims to reduce memory allocation in the receive message process. Using this with a stream-heavy workload can improve performance significantly.

RELEASE NOTES:

  • rpc_util: reduce memory allocation for the message parsing process in streaming RPC

hueypark avatar Dec 13 '22 07:12 hueypark

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: hueypark / name: Jaewan Park (493be5db050dbbc964b7ae90797057228ffe6daa, ecdb6d229520e6ed0b5830243f81ed98df2f250e, cb8827d8caae66a209b3cbf6d4e430f5d69b0a5d, bae37e3cae1d8eacd83e09c6077ee72fba78ec1c, cbaef54d582528b628acc40d6ac89519279446fa, f9730dd2fd5793abc0d4ca70b9893b8823442889, e7462501868ba7985b6e62593448e6ef5362be63, 711db78346a7cbee693dd11f7751d64c9ab49600, c05feed5c94abfc3d94c8eac26af4f1be359ce03, 9390057ebdb8db31d43c214bdb36bd48f8b3d2c4, b1c649610ee50fa5fe898e343e3fd40d6b728c88, 086dd289b748968ed5bdbc1fc90da97f6d7e539a, d6c1d9743ccbf81133e4a03ebf0310d180947ef2, 1b11bba548a9fc362c14bdb4844d25e3971183a8, a44c2005e3ada9f57680c1ac8ae8678e1963f253, 6b5000f634b92d5c745f517db92fcaed20c16ddc, c34da60d9f4432904743212fe385af436c38e02f, acec62662063acfd13ec30e74ebec877efadd777, 3cbb6b5df536f944b53c9bec30abbacefe54b1d9, 57b9c67038c51d0e9df37d9932107e99ae28628c, 76caf7405ba2e91dd7b4e9d31304b63f1d199d7f, 8f33b9bb821eaea4e5881861170005d0830b861a, 515556682270ce11a49bea548a2d31ca7eccdbe6, 15f820e51f3622eb0a79185fd985d117f4f1c044, 539eef3ec3bccc352e9f9124deda8208ffab0aa3, 056b3e5931f55e95a4d19bc1a1ce83e6fb7bc4bb, 0d07cfc3dfcc1bc2c21890e7d1efc7c40b404e1e, 7f6fdb28c4c31f6329505ae36a0f881c3a80a962, ca9f6deb694da7fe6c68a71e377e71e68b372247, 9be78899e3d38da51d696c13f89390d267c01da1, 25b60e39d0d54fcaebfe2389b517826c22ebaade, 8e8f683ac72b7b74e0bbcee2b6e14fce116dab65, 96f5a276daf93ec80162d7e58903abef081a1c72, fe1294f584874f68b58cd889c3013fc784acd995, 55354169ca42b4f3abbbf78debe9a89c4cd0ac1f, 2678efb1a7844ea090938abdb1d3e0f7288f43b4, 8199eebb1420f2e9058c5e9bc7cf6bc960821e77, a70df178eda2a4ce65d0fdbd4ebb55c79b7cd799, 86d999fa9a50774867c65c4b35e841316a530eff, 63a360ebbaf2285355f48a2a1f0aeadc6b2b6b73, 1f4bc3550a6ef2915b5c9da55279d73c209575d7, 3dcd8331869d75d5a00a63024aca8efd4346cdfc)

Could you please explain what problem you are trying to solve by making this change? Maybe we can have a discussion and see if the current approach is the way to go or if a different approach would serve better.

Using this with a stream-heavy workload can improve performance significantly.

Do you have any data to show that this change improves performance? And if so, could you please share that with us. Also, we do have a benchmark suite that you can use to run your changes against. Thanks.

easwars avatar Dec 15 '22 22:12 easwars

I implemented a feature stream counts at the benchmarkmain and ran my workloads. The following is the result of the benchmark.

unconstrained-networkMode_Local-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1048576B-respSize_1048576B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps         9251        10583    14.40%
             RecvOps         8150        11626    42.65%
            Bytes/op   5323921.33   3288925.84   -38.22%
           Allocs/op       621.94       610.97    -1.77%
             ReqT/op 7760301260.80 8877663846.40    14.40%
            RespT/op 6836715520.00 9752595660.80    42.65%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

And I have attached all of the benchmark results. cpuProfAfter.zip

hueypark avatar Dec 18 '22 13:12 hueypark

I implemented a feature stream counts at the benchmarkmain and ran my workloads.

What does this feature do?

And I have attached all of the benchmark results. cpuProfAfter.zip

This zip file contains a single file and I don't know how to view the results in there.

The benchmarks results that you have posted here are for a very specific case. It uses the following settings among others: workload type: unconstrained benchmark run time: 10s concurrent calls: 1 req size: 1048576B resp size: 1048576B

We would really like to see more comprehensive benchmark runs for a code change which is as fundamental as yours. Also, a benchmark run time of 10s is simply not good enough.

easwars avatar Dec 20 '22 17:12 easwars

Also, could you please explain why you want to make the change that you are making? Are you seeing some performance bottlenecks when running specific workloads?

easwars avatar Dec 20 '22 17:12 easwars

What does this feature do?

This feature allows the stream API to send multiple streams. In real-world scenarios, API users may not only send a single message with the stream.

This zip file contains a single file and I don't know how to view the results in there.

I apologize. It would probably be better to reproduce the issue locally. I will include the necessary commands for reproduction.

We would really like to see more comprehensive benchmark runs for a code change which is as fundamental as yours. Also, a benchmark run time of 10s is simply not good enough.

I aggred. I added a benchmark run time of 1m.

result:

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps        31405        31349    -0.18%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18072.12     18104.19     0.18%
           Allocs/op       216.46       215.70    -0.46%
             ReqT/op   4187333.33   4179866.67    -0.18%
            RespT/op   4187333.33   4179866.67    -0.18%
            50th-Lat   1.903201ms    1.90694ms     0.20%
            90th-Lat   1.958722ms   1.960585ms     0.10%
            99th-Lat   2.007894ms   2.015896ms     0.40%
             Avg-Lat   1.910174ms   1.913565ms     0.18%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps         3934         3920    -0.36%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     68941.53     52429.05   -23.95%
           Allocs/op       468.97       449.64    -4.05%
             ReqT/op    524533.33    522666.67    -0.36%
            RespT/op    524533.33    522666.67    -0.36%
            50th-Lat  15.209591ms  15.232099ms     0.15%
            90th-Lat  15.333759ms  15.356509ms     0.15%
            99th-Lat  16.770753ms  17.077881ms     1.83%
             Avg-Lat  15.252835ms  15.309135ms     0.37%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps        54388        51778    -4.80%
             RecvOps        77008        77023     0.02%
            Bytes/op      6123.87      4086.67   -33.26%
           Allocs/op        21.29        19.23    -9.40%
             ReqT/op   7251733.33   6903733.33    -4.80%
            RespT/op  10267733.33  10269733.33     0.02%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps      2602356      2842256     9.22%
             RecvOps      2664021      2037056   -23.53%
            Bytes/op      6315.69      3869.00   -38.74%
           Allocs/op        20.39        17.48   -14.71%
             ReqT/op 346980800.00 378967466.67     9.22%
            RespT/op 355202800.00 271607466.67   -23.53%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps           90           90     0.00%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     70131.82     53823.20   -23.25%
           Allocs/op       551.42       533.84    -3.26%
             ReqT/op     12000.00     12000.00     0.00%
            RespT/op     12000.00     12000.00     0.00%
            50th-Lat 668.543496ms 667.398531ms    -0.17%
            90th-Lat 674.226219ms 671.845385ms    -0.35%
            99th-Lat 677.662648ms 675.277176ms    -0.35%
             Avg-Lat 669.060189ms 668.227508ms    -0.12%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps        76866        76860    -0.01%
             RecvOps        76950        77004     0.07%
            Bytes/op      5909.29      3845.26   -34.93%
           Allocs/op        19.95        17.88   -10.03%
             ReqT/op  10248800.00  10248000.00    -0.01%
            RespT/op  10260000.00  10267200.00     0.07%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps       607492       605531    -0.32%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18057.20     18106.91     0.27%
           Allocs/op       215.88       215.90     0.00%
             ReqT/op  80998933.33  80737466.67    -0.32%
            RespT/op  80998933.33  80737466.67    -0.32%
            50th-Lat     95.001µs     95.301µs     0.32%
            90th-Lat     108.75µs    109.127µs     0.35%
            99th-Lat    223.706µs    225.913µs     0.99%
             Avg-Lat     98.429µs     98.753µs     0.33%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps     17801533     23479365    31.90%
             RecvOps     14600691     18649868    27.73%
            Bytes/op      5866.95      3813.91   -34.99%
           Allocs/op        20.04        17.69   -14.97%
             ReqT/op 2373537733.33 3130582000.00    31.90%
            RespT/op 1946758800.00 2486649066.67    27.73%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps           92           91    -1.09%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     70343.83     53902.51   -23.37%
           Allocs/op       566.60       540.46    -4.59%
             ReqT/op     12266.67     12133.33    -1.08%
            RespT/op     12266.67     12133.33    -1.08%
            50th-Lat 655.037524ms 666.202283ms     1.70%
            90th-Lat 656.737879ms  667.99468ms     1.71%
            99th-Lat 657.299189ms 669.693627ms     1.89%
             Avg-Lat 654.943331ms 665.739009ms     1.65%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          731          722    -1.23%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18071.45     18069.62    -0.01%
           Allocs/op       217.97       215.36    -0.92%
             ReqT/op     97466.67     96266.67    -1.23%
            RespT/op     97466.67     96266.67    -1.23%
            50th-Lat  82.127316ms  83.219196ms     1.33%
            90th-Lat  82.400964ms  84.170818ms     2.15%
            99th-Lat  82.576049ms   84.53085ms     2.37%
             Avg-Lat  82.161793ms  83.147415ms     1.20%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps       117155       120640     2.97%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     69970.61     53383.67   -23.71%
           Allocs/op       522.46       499.58    -4.40%
             ReqT/op  15620666.67  16085333.33     2.97%
            RespT/op  15620666.67  16085333.33     2.97%
            50th-Lat    494.255µs    482.374µs    -2.40%
            90th-Lat    541.998µs     522.98µs    -3.51%
            99th-Lat    917.151µs    888.174µs    -3.16%
             Avg-Lat      511.8µs    497.027µs    -2.89%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          736          722    -1.90%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18025.93     18032.04     0.04%
           Allocs/op       214.81       213.43    -0.47%
             ReqT/op     98133.33     96266.67    -1.90%
            RespT/op     98133.33     96266.67    -1.90%
            50th-Lat  81.676483ms   83.26441ms     1.94%
            90th-Lat  82.441415ms  84.380606ms     2.35%
            99th-Lat  82.518705ms  84.539625ms     2.45%
             Avg-Lat  81.612729ms  83.167132ms     1.90%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

commands:

go run benchmark/benchmain/main.go -benchtime=1m \
          -workloads=all -preloader=on \
          -maxConcurrentCalls=1 \
          -reqSizeBytes=1000 -respSizeBytes=1000 -streamCounts=8 \
          -resultFile=beforeResult

-- after checkout use buffer commit

go run benchmark/benchmain/main.go -benchtime=1m \
          -workloads=all -preloader=on \
          -maxConcurrentCalls=1 \
          -reqSizeBytes=1000 -respSizeBytes=1000 -streamCounts=8 \
          -resultFile=afterResult

go run benchmark/benchresult/main.go beforeResult afterResult

hueypark avatar Dec 24 '22 11:12 hueypark

Also, could you please explain why you want to make the change that you are making? Are you seeing some performance bottlenecks when running specific workloads?

I am developing an OLAP database and our map reduce framework relies on gRPC to send and receive large amounts of data. Recently, we discovered that gRPC is allocating unnecessary memory for the stream API, which is causing our system to slow down. To improve performance, we want to reduce the memory allocation.

Actual code link:

  • go source: https://github.com/ab180/lrmr/blob/fb116c37932313e579264d34c0770fc2313e4d8d/output/push_stream.go#L63
  • rpc: https://github.com/ab180/lrmr/blob/fb116c37932313e579264d34c0770fc2313e4d8d/lrmrpb/rpc.proto#L14

hueypark avatar Dec 24 '22 11:12 hueypark

Could you please send our your changes to the benchmarking code as a separate PR. Thanks.

easwars avatar Dec 27 '22 20:12 easwars

Done with https://github.com/grpc/grpc-go/pull/5898. I greatly appreciate your review.

hueypark avatar Dec 27 '22 21:12 hueypark

The problem with the approach taken in this PR is that once a really big message is received on the stream, the memory allocated for it will never be released. So, let's say a stream sends a 4MB message to start with, and then only sends 1B messages for the rest of its life, we will still hold on to the 4MB buffer.

We have discussed options for this in the past, and the option that seems like it would work best is as follows:

  • have a bunch of sync.Pools that cache recv buffers
  • each sync.Pool would cache buffers of a fixed size
  • and we would have sync.Pools for logarithmically increasing buffer sizes. For example, we might have a set of sync.Pools for sizes 10B, 100B, 1KB, 10KB, 100KB etc ...
  • each time we read the headers, and determine the length of the message, we would get the most appropriately sized buffer from the corresponding sync.Pool, and once we are done using the message, we would release it back to the pool

The above described change is more involved and would involve more benchmarking. Please let us know if you are interested in this, and we can discuss more before you get started.

easwars avatar Jan 17 '23 23:01 easwars

I am interested. Please let me know what the next step is so I can proceed and resolve the issue.

hueypark avatar Jan 19 '23 00:01 hueypark

ping @easwars to follow up

arvindbr8 avatar Jan 24 '23 18:01 arvindbr8

I am interested. Please let me know what the next step is so I can proceed and resolve the issue.

A brief overview of the approach that we would like to take is described here: https://github.com/grpc/grpc-go/pull/5862#issuecomment-1386243942. Once you get more familiar with the changes that would be needed to implement that approach, we can discuss again, and talk in greater depth about the proposed set of changes that you intend to make.

One thing we would like to mention up front is that, if there is no significant improvement in the benchmarks after doing all that work, we might have to abandon it. That is something that we would have to live with.

Please let us know when you have a better understanding of the changes to be made, and we can either discuss it here or you can open a new issue for that.

Thanks.

easwars avatar Jan 24 '23 20:01 easwars

A brief overview of the approach that we would like to take is described here: #5862 (comment). Once you get more familiar with the changes that would be needed to implement that approach, we can discuss again, and talk in greater depth about the proposed set of changes that you intend to make.

One thing we would like to mention up front is that, if there is no significant improvement in the benchmarks after doing all that work, we might have to abandon it. That is something that we would have to live with.

Please let us know when you have a better understanding of the changes to be made, and we can either discuss it here or you can open a new issue for that.

Thanks.

I am currently utilizing a sync.Pool for this implementation. Because I am uncertain if implementing multiple fixed-size sync.Pool's would result in a reduction of memory usage throughout the program.

Thankfully, the benchmark results are outstanding.

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          736          735    -0.14%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      9888.73      7889.85   -20.21%
           Allocs/op        73.52        72.59    -1.36%
             ReqT/op     98133.33     98000.00    -0.14%
            RespT/op     98133.33     98000.00    -0.14%
            50th-Lat  81.595871ms     81.625ms     0.04%
            90th-Lat  82.288644ms  82.302893ms     0.02%
            99th-Lat  82.387358ms  82.401457ms     0.02%
             Avg-Lat  81.592392ms  81.652686ms     0.07%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps      2857150      3076809     7.69%
             RecvOps      6389357      5638488   -11.75%
            Bytes/op      6882.22      4605.70   -33.09%
           Allocs/op        21.77        21.18     0.00%
             ReqT/op 380953333.33 410241200.00     7.69%
            RespT/op 851914266.67 751798400.00   -11.75%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps        76867        76856    -0.01%
             RecvOps        77063        76732    -0.43%
            Bytes/op      5901.64      3896.63   -33.97%
           Allocs/op        19.90        19.89     0.00%
             ReqT/op  10248933.33  10247466.67    -0.01%
            RespT/op  10275066.67  10230933.33    -0.43%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps        31407        31456     0.16%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18056.47     17702.29    -1.96%
           Allocs/op       216.16       218.27     0.93%
             ReqT/op   4187600.00   4194133.33     0.16%
            RespT/op   4187600.00   4194133.33     0.16%
            50th-Lat   1.902471ms   1.902914ms     0.02%
            90th-Lat   1.960533ms   1.958071ms    -0.13%
            99th-Lat    2.04899ms   2.012327ms    -1.79%
             Avg-Lat   1.910015ms   1.907077ms    -0.15%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps        31522        31559     0.12%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      9724.26      7720.42   -20.61%
           Allocs/op        61.73        61.27     0.00%
             ReqT/op   4202933.33   4207866.67     0.12%
            RespT/op   4202933.33   4207866.67     0.12%
            50th-Lat   1.897042ms   1.897172ms     0.01%
            90th-Lat   1.953226ms   1.953482ms     0.01%
            99th-Lat   2.016064ms   1.991549ms    -1.22%
             Avg-Lat   1.902967ms   1.900766ms    -0.12%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          736          735    -0.14%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     17994.00     17787.98    -1.15%
           Allocs/op       213.51       215.05     0.94%
             ReqT/op     98133.33     98000.00    -0.14%
            RespT/op     98133.33     98000.00    -0.14%
            50th-Lat  81.557537ms  81.671239ms     0.14%
            90th-Lat  82.308758ms  82.404262ms     0.12%
            99th-Lat  82.543432ms  82.571359ms     0.03%
             Avg-Lat  81.579228ms  81.727436ms     0.18%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps        52201        52035    -0.32%
             RecvOps        77056        77022    -0.04%
            Bytes/op      6138.42      4144.01   -32.48%
           Allocs/op        21.19        21.30     0.00%
             ReqT/op   6960133.33   6938000.00    -0.32%
            RespT/op  10274133.33  10269600.00    -0.04%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          731          731     0.00%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18040.90     17777.13    -1.46%
           Allocs/op       217.93       220.37     1.38%
             ReqT/op     97466.67     97466.67     0.00%
            RespT/op     97466.67     97466.67     0.00%
            50th-Lat  82.080816ms   82.17037ms     0.11%
            90th-Lat  82.252643ms  82.298411ms     0.06%
            99th-Lat  82.492231ms  82.585971ms     0.11%
             Avg-Lat  82.089176ms  82.186704ms     0.12%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_40ms-kbps_10240-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps          728          729     0.14%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      9916.45      7894.83   -20.39%
           Allocs/op        73.47        73.11     0.00%
             ReqT/op     97066.67     97200.00     0.14%
            RespT/op     97066.67     97200.00     0.14%
            50th-Lat  82.254364ms  82.170236ms    -0.10%
            90th-Lat  83.288298ms  83.179603ms    -0.13%
            99th-Lat    83.9771ms  83.946449ms    -0.04%
             Avg-Lat  82.461816ms  82.383182ms    -0.10%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps       540539       544492     0.73%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     18044.52     17782.27    -1.45%
           Allocs/op       214.97       218.02     1.86%
             ReqT/op  72071866.67  72598933.33     0.73%
            RespT/op  72071866.67  72598933.33     0.73%
            50th-Lat      102.6µs    104.938µs     2.28%
            90th-Lat    126.202µs    125.717µs    -0.38%
            99th-Lat    279.492µs    256.144µs    -8.35%
             Avg-Lat    110.634µs    109.812µs    -0.74%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

streaming-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps       869136       849326    -2.28%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      9842.30      7849.17   -20.25%
           Allocs/op        68.75        68.31     0.00%
             ReqT/op 115884800.00 113243466.67    -2.28%
            RespT/op 115884800.00 113243466.67    -2.28%
            50th-Lat     65.393µs     68.259µs     4.38%
            90th-Lat     76.473µs     78.303µs     2.39%
            99th-Lat     144.69µs    117.089µs   -19.08%
             Avg-Lat     68.728µs     70.309µs     2.30%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-
               Title       Before        After Percentage
            TotalOps            0            0      NaN%
             SendOps     14678884     18847588    28.40%
             RecvOps     13348284     15292041    14.56%
            Bytes/op      5925.54      3877.99   -34.56%
           Allocs/op        20.23        19.77    -4.94%
             ReqT/op 1957184533.33 2513011733.33    28.40%
            RespT/op 1779771200.00 2038938800.00    14.56%
            50th-Lat           0s           0s      NaN%
            90th-Lat           0s           0s      NaN%
            99th-Lat           0s           0s      NaN%
             Avg-Lat           0s           0s      NaN%
           GoVersion     go1.19.4     go1.19.4
         GrpcVersion   1.53.0-dev   1.53.0-dev

commands:

go run benchmark/benchmain/main.go -benchtime=1m \
          -workloads=all -preloader=on \
          -maxConcurrentCalls=1 \
          -reqSizeBytes=1000 -respSizeBytes=1000 \
          -resultFile=beforeResult

-- after checkout use `sync.Pool` commit

go run benchmark/benchmain/main.go -benchtime=1m \
          -workloads=all -preloader=on \
          -maxConcurrentCalls=1 \
          -reqSizeBytes=1000 -respSizeBytes=1000 \
          -resultFile=afterResult

go run benchmark/benchresult/main.go beforeResult afterResult

hueypark avatar Jan 25 '23 11:01 hueypark

Thank you for trying the sync.Pool approach. But, we would still not be able to accept the PR without multiple sync pools for different sized messages because this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer.

One benchmark showing improved results does not mean that a change is appropriate for all use cases.

easwars avatar Jan 30 '23 22:01 easwars

this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer

I don't think this is true, now. A GC will recycle anything in the pool, which is what we want, and avoids keeping a big buffer in memory forever.

However, it still uses a single simple sync.Pool, but with any potential buffer size, so that when Get is called, it pulls anything out of the pool and then resizes it if needed and throws the old one away if it's too small. So while this works, and will look good for some workloads, it might have problems with other workloads. E.g. two concurrent streams, one with large messages and one with small messages, could result in poor usage patterns. Creating multiple pools and retrieving from the appropriate pool might work, but then we need to create a large number of sync.Pools for that, and I don't know whether the tradeoff there is desirable or not. It would take some real-world measurements, probably.

Another thing I'd like is a way to leave this off (at least initially) by default, but have some kind of setting to enable it. If it turns out to be mostly beneficial, we would enable it by default and provide a corresponding way to disable it.

dfawley avatar Jan 31 '23 00:01 dfawley

@easwars

But, we would still not be able to accept the PR without multiple sync pools for different sized messages because this implementation has the same drawback as the previous one. A single extremely large message at the start of the stream followed by small messages for the remainder of its lifetime would still leave the stream with the big buffer.

I implemented pools of varying sizes for the parser. cb8827d8caae66a209b3cbf6d4e430f5d69b0a5d

@dfawley

Another thing I'd like is a way to leave this off (at least initially) by default, but have some kind of setting to enable it. If it turns out to be mostly beneficial, we would enable it by default and provide a corresponding way to disable it.

I introduced the "useBytesPoolForParser" option and set its default value to false. bae37e3cae1d8eacd83e09c6077ee72fba78ec1c

hueypark avatar Jan 31 '23 16:01 hueypark

@easwars

Also, we should think about a more comprehensive benchmarking plan for these changes Even though we are gating it behind a dial option, having some good benchmark data can be useful for us to be convinced about the need for these changes, but also as something that we can point our users to when they have a hard time deciding whether they want to use this feature.

In commit c05feed5c94abfc3d94c8eac26af4f1be359ce03, I added a feature for shared receive buffers (nil and simple) for the benchmark. It's possible that we could add more shared receive buffer options for the benchmark in the future with ease.

I also have a benchmark result for a specific case to share.

command:

go run benchmark/benchmain/main.go -benchtime=1m \
          -workloads=unconstrained -preloader=on \
          -maxConcurrentCalls=1 \
          -kbps=0 -latency=0ms \
          -reqSizeBytes=1000 -respSizeBytes=1000

result:

go1.20.1/grpc1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sharedRecvBufferPool_nil-:
Bytes/op: 5924.858174205665	Allocs/op: 21.944834484531267	
Histogram (empty)
Number of requests:  16624263	Request throughput:  2.2165684e+09 bit/s
Number of responses: 13076104	Response throughput: 1.7434805333333333e+09 bit/s

go1.20.1/grpc1.53.0-dev
unconstrained-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1000B-respSize_1000B-compressor_off-channelz_false-preloader_true-clientReadBufferSize_-1-clientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sharedRecvBufferPool_simple-:
Bytes/op: 3884.258894307068	Allocs/op: 19.696514863496976	
Histogram (empty)
Number of requests:  19910184	Request throughput:  2.6546912e+09 bit/s
Number of responses: 15540020	Response throughput: 2.0720026666666667e+09 bit/s

hueypark avatar Feb 25 '23 09:02 hueypark

@easwars Would it be possible to consider releasing this in Milestone 1.54? The ETA is crucial information that would greatly benefit the performance of my project. I would appreciate it if you could kindly provide an answer to this question.

hueypark avatar Mar 15 '23 11:03 hueypark

@easwars Would it be possible to consider releasing this in Milestone 1.54? The ETA is crucial information that would greatly benefit the performance of my project. I would appreciate it if you could kindly provide an answer to this question.

We are going to be cutting the release branch for 1.54 either today or tomorrow (and we are already late by a week). We only cherrypick important bug fixes into the release branch, after the branch cut and before the actual release. So, I cannot guarantee at this point that we will be able to get this PR into 1.54.

easwars avatar Mar 15 '23 18:03 easwars

I think the changes are mostly moving towards where we want it to be. Thanks for doing this.

Thank you for taking the time to review my work! It's really motivating to know that we're making progress and getting closer to the finish line.

hueypark avatar Mar 15 '23 20:03 hueypark

In addition to the inline comments, there are no tests added in this PR. Let's make sure there's coverage for anything we add.

I have added a test for the shared buffer pool. If you require any additional tests, please feel free to let me know.

hueypark avatar Apr 09 '23 07:04 hueypark

I have added a test for the shared buffer pool. If you require any additional tests, please feel free to let me know.

I think it would be useful to have a test at the gRPC API level.. Just something that does a few (10?) RPCs with a shared buffer pool used in the client and server. You could copy something like this test case and hopefully it won't take long to do as this should be even simpler than that one.

dfawley avatar Apr 18 '23 21:04 dfawley

I think it would be useful to have a test at the gRPC API level.. Just something that does a few (10?) RPCs with a shared buffer pool used in the client and server. You could copy something like this test case and hopefully it won't take long to do as this should be even simpler than that one.

I have implemented a new test TestRecvBufferPool. 8e8f683ac72b7b74e0bbcee2b6e14fce116dab65

hueypark avatar Apr 19 '23 14:04 hueypark

@dfawley Thank you for your review!

hueypark avatar May 20 '23 06:05 hueypark

Thank you. We can certainly wait for @easwars to return for the final review. His insights would be valuable.

hueypark avatar May 31 '23 12:05 hueypark

Thank you for this PR! I'm testing this patch in our dev environment. Description mentions "streaming rpc", but code to get buffer from the pool is also hit for unary methods, and experiment confirms that. It wasn't clear to me whether this would work for unary methods too. Perhaps the description could clarify that?

Update: Summary of my findings:

  • On server side, pooling of buffers is supported for streaming methods. For unary methods, received messages are put into buffer from the pool, but buffer is never returned back to pool.
  • On client side, pooling seem to work for messages received for both streaming and unary methods.

pstibrany avatar Jun 08 '23 08:06 pstibrany

Also, looks like there are some merge conflicts in the benchmark files. I would be OK to move them to separate PR as well, if that makes life easier. Thanks.

easwars avatar Jun 27 '23 01:06 easwars

Also, looks like there are some merge conflicts in the benchmark files. I would be OK to move them to separate PR as well, if that makes life easier. Thanks.

I have made a commit to resolve the conflicts. 86d999fa9a50774867c65c4b35e841316a530eff

hueypark avatar Jun 27 '23 12:06 hueypark

Thank you very much for your contribution @hueypark and apologies for the really drawn out review process on this PR.

easwars avatar Jun 27 '23 15:06 easwars