server: Stats handler and InTapHandle called during critical read path
Moving the service handler goroutine forking into the transport (from server.go) would allow other reads to proceed without blocking on the tap/stats handlers.
Anything below line 385, including stats handler, can easily be moved to another go routine: https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_server.go#L385
Code between lines 358 and 385 does some checks and then saves the created stream into activeStreams map. It makes use of transport's mutex. So, moving them into another go routine risks lock contention.
InTapHandle code lives between lines 341 and 358 and doesn't use the mutex. If it is OK to run it after stream creation, then it can be moved into the new go routine. But its documentation says that InTapHandle should run before the stream is created.
This draft shows how the safe part can be moved into its own go routine: https://github.com/grpc/grpc-go/compare/master...canguler:a?expand=1
I changed my draft code a bit more and ran benchmarks. New implementation (with forking a go routine) had ~5% worse QPS and latency results. Code here.
Benchmark results:
go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary -compression=off -maxConcurrentCalls=100 -reqSizeBytes=1 -respSizeBytes=1 -networkMode=Local
Without this change:
Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_off-Preloader_false:
50_Latency: 708.0560 µs 90_Latency: 1323.9940 µs 99_Latency: 1657.3950 µs Avg latency: 849.0950 µs Count: 1177079 9450 Bytes/op 169 Allocs/op
Histogram (unit: µs)
Count: 1177079 Min: 121.9 Max: 28079.0 Avg: 849.10
------------------------------------------------------------
[ 121.905000, 121.906000) 1 0.0% 0.0%
[ 121.906000, 121.911720) 0 0.0% 0.0%
[ 121.911720, 121.950162) 0 0.0% 0.0%
[ 121.950162, 122.208504) 0 0.0% 0.0%
[ 122.208504, 123.944634) 0 0.0% 0.0%
[ 123.944634, 135.611937) 0 0.0% 0.0%
[ 135.611937, 214.019607) 6 0.0% 0.0%
[ 214.019607, 740.941973) 705278 59.9% 59.9% ######
[ 740.941973, 4282.013644) 468485 39.8% 99.7% ####
[ 4282.013644, 28079.046000) 3308 0.3% 100.0%
[ 28079.046000, inf) 1 0.0% 100.0%
Number of requests: 1177079 Request throughput: 941663.2 bit/s
Number of responses: 1177079 Response throughput: 941663.2 bit/s
With this change:
Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1B-respSize_1B-Compressor_off-Preloader_false:
50_Latency: 730.0850 µs 90_Latency: 1383.5280 µs 99_Latency: 1969.0170 µs Avg latency: 896.3180 µs Count: 1115219 9460 Bytes/op 169 Allocs/op
Histogram (unit: µs)
Count: 1115219 Min: 272.7 Max: 28344.5 Avg: 896.32
------------------------------------------------------------
[ 272.665000, 272.666000) 1 0.0% 0.0%
[ 272.666000, 272.671723) 0 0.0% 0.0%
[ 272.671723, 272.710203) 0 0.0% 0.0%
[ 272.710203, 272.968918) 0 0.0% 0.0%
[ 272.968918, 274.708349) 0 0.0% 0.0%
[ 274.708349, 286.403149) 0 0.0% 0.0%
[ 286.403149, 365.031371) 21 0.0% 0.0%
[ 365.031371, 893.676332) 827263 74.2% 74.2% #######
[ 893.676332, 4447.940831) 283711 25.4% 99.6% ###
[ 4447.940831, 28344.501000) 4222 0.4% 100.0%
[ 28344.501000, inf) 1 0.0% 100.0%
Number of requests: 1115219 Request throughput: 892175.2 bit/s
Number of responses: 1115219 Response throughput: 892175.2 bit/s
ping @menghanl