grpc-go
grpc-go copied to clipboard
long-distance hops - latency doubles when msg exceeds 16kb
Hi We have a go grpc client (1.39) running in US and a go grpc server(1.39) on APAC. Both on AWS. We are opening a grpc channel and executing unary calls. when the request message size is lower than 16KB, roundtrip latency is 240ms when the request message size is crossing the 16KKB (e.g 30KB), round trip latency jumps to 440ms when the request message size is crossing the 100KB, round trip latency jumps to 650ms (response msg was set to < 16KB)
Also, if we increase the rate of the 16KB+ messages from 1/sec to 2/sec or more, the round latency drops to 240ms Also, if we run few dummy (tiny payload) calls per seconds, in parallel to the big (30K) payload calls, on the same connection => the latency of the big payload calls is reduced to 240ms as well
We tcpdump the traffic: We see the first 16kb frame is sent from the client to the server, but the 2nd one is not sent until an ack is returned from the server. In our case, our msg size is 30KB, and round trip latency for each frame is ~220ms. So because of this, call takes ~440ms instead of ~220ms.
We set the InitialWindowSize and InitialConnWindowSize to 1MB, on both server and client - no change. We checked both unary RPCs and Stream RPCs - the same. We set the WriteBufferSize and ReadBufferSize to zero (write/read directly to/from the wire) - 95% latency remained the same, but Avg latency dropped by 100ms - not sure why it had this effect.
Again, in all of the conditions above, if we increase to rate of messages to more then 2 or 3 per second, the latency drops to 240ms
Looking at http2debug=2 logs, when executing unary calls, it seems that when running the higher rate (> 2,3 per sec) the new RPC somehow uses a previous opened stream that was created by a previous RPC
We ruled out load balancers as we connected the client & server pods directly and observed the behaviour remains the same
Aare you familiar with this kind of behaviour. ? is it expected ? Any way to overcome it ? Any help would be very much appreciated Thanks !
Hello, thank you for bringing this issue up. This is pretty vague, so it would be great if you could provide some code that I could pull down and reproduce this issue with, and if so isolate down to a problem specifically in our repository. Latency can be affected by many things in the network, such as proxies, etc. We actually have a tool in our codebase to simulate latency to make the connection exactly like you want: https://github.com/grpc/grpc-go/blob/master/benchmark/latency/latency.go. Thanks!
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi. We will extract some code for you to reproduce the issue... Thanks
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi @zasweq,
We have the needed reproduction for you:
A Client & Server implementations in GO are located in this public repo - https://github.com/eladshoval/grpc-latency , the README.md contains the instructions on how to build and run it (the app is built and packaged into a docker).
Should be run from 2 separate hosts (one for client one for servers), we ran it on EC2 in AWS - one in us-east-1 region and one is ap-southeast-1 (need to make sure the client is allowed on the server host security rules, along with port 6666).
Reproduction details:
1) Run with config of 2 seconds delay between each call & 32KB of payload (default value, also can be configured via env-ver MSG_LEN)
docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e DELAY_BETWEEN_MSGS_MS='2000' pokemon:latest
This yields the outcome of +620ms latency from the client side for the round-trip -
2022/09/19 11:33:51 Greeting: rate of every 2000 ms, duration = 623.310301ms, count = 200, avg = 623588
2) Run with config of 200 milli-second delay between each call & 32KB of payload (default value, also can be configured via env-ver MSG_LEN)
docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e DELAY_BETWEEN_MSGS_MS='200' pokemon:latest
This yields the outcome of ~210ms latency from the client side for the round-trip (x3 faster) -
2022/09/19 11:34:51 Greeting: rate of every 200 ms, duration = 211.670484ms, count = 200, avg = 243993
3) Run with config of 2 seconds delay between each call & 12KB of payload (one that fits 1 single http2 frame)
docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e MSG_LEN='12000' pokemon:latest
This also yields the outcome of ~210ms latency from the client side for the round-trip (x3 faster) -
2022/09/20 09:41:41 Greeting: rate of every 2000 ms, duration = 210.770126ms, count = 200, avg = 213674
==> Higher rate and/or smaller payload yields faster latency.
As @AlonKaftan described above - if you could help us understand this behavior better we'd appreciate it. Needless to say that we're latency sensitive and wish to learn how to utilize go-grpc to reach lowest latency, in a consistent manner.
For any issue with the code/run of the reproduction - please contact us. Thanks for your assistance!
Hey, thank you so much for the reproduction. This is probably due to the heuristics of the BDP estimation algorithm (estimates BDP from bytes sent * time), which if we want to change we would need to switch the heuristics, and for your use case, it's probably WAI for our thoughts on the algorithm. This would require a lot of engineering cycles, and I don't know if the team has bandwidth, but we'll see in the next few weeks. In the meantime, could you try manually configuring the window size, especially since you seem to know the size of your payloads?
@eladshoval, we're not entirely sure what is happening here, but this is unexpected. Can you experiment with this a little bit more, varying the size of your request payload (also try 16KB and 20KB) and some other gRPC settings?
- https://pkg.go.dev/google.golang.org/grpc#WithWriteBufferSize (try 64KB)
- https://pkg.go.dev/google.golang.org/grpc#WithInitialConnWindowSize and https://pkg.go.dev/google.golang.org/grpc#WithInitialWindowSize (try 128KB for both together)
What is the RTT reported by ping on this connection?
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi @dfawley & @zasweq,
I made the attempt to run with the advised configuration (can be seen in the github project) - sadly trying the suggested (with various payload sizes) made no change to the outcome from what I have already depicted. As we've also seen before in our own attempts to adjust these configurations, prior to opening the ticket.
I made an additional run with more in-depth logging of go-grpc + http2.
GRPC_GO_LOG_SEVERITY_LEVEL=debug
GODEBUG="http2debug=2"
Findings were detailed in this shared doc - https://docs.google.com/document/d/1m1pxU2OIPPeHIivUx3oePcPJdkHXgCAAHB8N2bpha9c/edit?usp=sharing
In summary - you'll be able to see 2 scenarios/run, one with higher rate and one with lower rate of calls in which single-trip (Client->Sever or Server->Client) behaves differently, longer latency on low rate and shorter latency on high rate of calls.
BTW - Ping is ~209ms (AWS, us-east-1 -> ap-southeast-1)
We highly appreciate your assistance and hope this would help with getting a bit closer to finding the issue.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi @dfawley Any findings so far ? Thoughts how to pursue this ? Thanks
Sorry; I would love to reproduce this and debug, but I just haven't had the time.
It would really help if you could take this example and make it run in a single binary, and have the client and server communicate using a connection wrapped with https://pkg.go.dev/google.golang.org/[email protected]/benchmark/latency to inject the 200+ms RTT latency you have. If it reproduces in that environment, it will be a lot easier to track it down than needing to run in specific AWS zones. It is also possible the problem is caused by external factors in this setup in which you are testing.
This is an example to show how the latency package can be used with gRPC: https://github.com/grpc/grpc-go/blob/26071c24f30ed6b9ba8a512f26e9f266ec8b083e/benchmark/benchmain/main.go#L267
https://github.com/grpc/grpc-go/blob/26071c24f30ed6b9ba8a512f26e9f266ec8b083e/benchmark/benchmain/main.go#L327-L331
(pass lis to grpc.Server.Serve)
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi @dfawley We will try, currently we are busy writing workarounds for this issue as it affects our production systems.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Hi @dfawley, Can you please explain how do you feel the reproduction with the lib you suggested would help track the issue we see in terms of increased latency we see specifically cross regions communication? We are not sure how this would be better than the test app supplied which reproduce the issue. e.g. in case we won't reach reproduction with using your benchmark lib in standalone app reproduction (without actual grpc communication cross regions) - would this mean in your PoV that everything is OK and there isn't any issue? (The reason for asking is that we're a bit overloaded and wish to understand what difference investing in further reproduction would get us) Again, really appreciate your assistance to get to the bottom of this
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
A big reason to use that latency injection package is the fact that this will scope the problem down to our repository, rather than the possibility that external factors in the environment in which you are testing (i.e. multiple AWS zones). Once scoped down to our repository, this will make it easily reproducible on our end, rather than having to deploy to your exact environment. Thanks.
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.