grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

long-distance hops - latency doubles when msg exceeds 16kb

Open AlonKaftan opened this issue 3 years ago • 7 comments

Hi We have a go grpc client (1.39) running in US and a go grpc server(1.39) on APAC. Both on AWS. We are opening a grpc channel and executing unary calls. when the request message size is lower than 16KB, roundtrip latency is 240ms when the request message size is crossing the 16KKB (e.g 30KB), round trip latency jumps to 440ms when the request message size is crossing the 100KB, round trip latency jumps to 650ms (response msg was set to < 16KB)

Also, if we increase the rate of the 16KB+ messages from 1/sec to 2/sec or more, the round latency drops to 240ms Also, if we run few dummy (tiny payload) calls per seconds, in parallel to the big (30K) payload calls, on the same connection => the latency of the big payload calls is reduced to 240ms as well

We tcpdump the traffic: We see the first 16kb frame is sent from the client to the server, but the 2nd one is not sent until an ack is returned from the server. In our case, our msg size is 30KB, and round trip latency for each frame is ~220ms. So because of this, call takes ~440ms instead of ~220ms.

We set the InitialWindowSize and InitialConnWindowSize to 1MB, on both server and client - no change. We checked both unary RPCs and Stream RPCs - the same. We set the WriteBufferSize and ReadBufferSize to zero (write/read directly to/from the wire) - 95% latency remained the same, but Avg latency dropped by 100ms - not sure why it had this effect.

Again, in all of the conditions above, if we increase to rate of messages to more then 2 or 3 per second, the latency drops to 240ms

Looking at http2debug=2 logs, when executing unary calls, it seems that when running the higher rate (> 2,3 per sec) the new RPC somehow uses a previous opened stream that was created by a previous RPC

We ruled out load balancers as we connected the client & server pods directly and observed the behaviour remains the same

Aare you familiar with this kind of behaviour. ? is it expected ? Any way to overcome it ? Any help would be very much appreciated Thanks !

AlonKaftan avatar Aug 22 '22 07:08 AlonKaftan

Hello, thank you for bringing this issue up. This is pretty vague, so it would be great if you could provide some code that I could pull down and reproduce this issue with, and if so isolate down to a problem specifically in our repository. Latency can be affected by many things in the network, such as proxies, etc. We actually have a tool in our codebase to simulate latency to make the connection exactly like you want: https://github.com/grpc/grpc-go/blob/master/benchmark/latency/latency.go. Thanks!

zasweq avatar Aug 30 '22 18:08 zasweq

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Sep 05 '22 18:09 github-actions[bot]

Hi. We will extract some code for you to reproduce the issue... Thanks

AlonKaftan avatar Sep 08 '22 23:09 AlonKaftan

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Sep 15 '22 01:09 github-actions[bot]

Hi @zasweq,

We have the needed reproduction for you:

A Client & Server implementations in GO are located in this public repo - https://github.com/eladshoval/grpc-latency , the README.md contains the instructions on how to build and run it (the app is built and packaged into a docker).

Should be run from 2 separate hosts (one for client one for servers), we ran it on EC2 in AWS - one in us-east-1 region and one is ap-southeast-1 (need to make sure the client is allowed on the server host security rules, along with port 6666).

Reproduction details:

1) Run with config of 2 seconds delay between each call & 32KB of payload (default value, also can be configured via env-ver MSG_LEN)

docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e DELAY_BETWEEN_MSGS_MS='2000' pokemon:latest

This yields the outcome of +620ms latency from the client side for the round-trip -

2022/09/19 11:33:51 Greeting: rate of every 2000 ms, duration = 623.310301ms, count = 200, avg = 623588

2) Run with config of 200 milli-second delay between each call & 32KB of payload (default value, also can be configured via env-ver MSG_LEN)

docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e DELAY_BETWEEN_MSGS_MS='200' pokemon:latest

This yields the outcome of ~210ms latency from the client side for the round-trip (x3 faster) -

2022/09/19 11:34:51 Greeting: rate of every 200 ms, duration = 211.670484ms, count = 200, avg = 243993

3) Run with config of 2 seconds delay between each call & 12KB of payload (one that fits 1 single http2 frame)

docker run -it -e WORK_MODE='client' -e TARGET_HOST='[ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com](http://ec2-1-2-3-4.ap-southeast-1.compute.amazonaws.com/)' -e TARGET_PORT='6666' -e MSG_LEN='12000' pokemon:latest

This also yields the outcome of ~210ms latency from the client side for the round-trip (x3 faster) -

2022/09/20 09:41:41 Greeting: rate of every 2000 ms, duration = 210.770126ms, count = 200, avg = 213674

==> Higher rate and/or smaller payload yields faster latency.

As @AlonKaftan described above - if you could help us understand this behavior better we'd appreciate it. Needless to say that we're latency sensitive and wish to learn how to utilize go-grpc to reach lowest latency, in a consistent manner.

For any issue with the code/run of the reproduction - please contact us. Thanks for your assistance!

eladshoval avatar Sep 20 '22 09:09 eladshoval

Hey, thank you so much for the reproduction. This is probably due to the heuristics of the BDP estimation algorithm (estimates BDP from bytes sent * time), which if we want to change we would need to switch the heuristics, and for your use case, it's probably WAI for our thoughts on the algorithm. This would require a lot of engineering cycles, and I don't know if the team has bandwidth, but we'll see in the next few weeks. In the meantime, could you try manually configuring the window size, especially since you seem to know the size of your payloads?

zasweq avatar Sep 20 '22 17:09 zasweq

@eladshoval, we're not entirely sure what is happening here, but this is unexpected. Can you experiment with this a little bit more, varying the size of your request payload (also try 16KB and 20KB) and some other gRPC settings?

  • https://pkg.go.dev/google.golang.org/grpc#WithWriteBufferSize (try 64KB)
  • https://pkg.go.dev/google.golang.org/grpc#WithInitialConnWindowSize and https://pkg.go.dev/google.golang.org/grpc#WithInitialWindowSize (try 128KB for both together)

What is the RTT reported by ping on this connection?

dfawley avatar Sep 20 '22 23:09 dfawley

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Sep 27 '22 01:09 github-actions[bot]

Hi @dfawley & @zasweq,

I made the attempt to run with the advised configuration (can be seen in the github project) - sadly trying the suggested (with various payload sizes) made no change to the outcome from what I have already depicted. As we've also seen before in our own attempts to adjust these configurations, prior to opening the ticket.

I made an additional run with more in-depth logging of go-grpc + http2.

GRPC_GO_LOG_SEVERITY_LEVEL=debug
GODEBUG="http2debug=2"

Findings were detailed in this shared doc - https://docs.google.com/document/d/1m1pxU2OIPPeHIivUx3oePcPJdkHXgCAAHB8N2bpha9c/edit?usp=sharing

In summary - you'll be able to see 2 scenarios/run, one with higher rate and one with lower rate of calls in which single-trip (Client->Sever or Server->Client) behaves differently, longer latency on low rate and shorter latency on high rate of calls.

BTW - Ping is ~209ms (AWS, us-east-1 -> ap-southeast-1)

We highly appreciate your assistance and hope this would help with getting a bit closer to finding the issue.

eladshoval avatar Sep 28 '22 10:09 eladshoval

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Oct 04 '22 13:10 github-actions[bot]

Hi @dfawley Any findings so far ? Thoughts how to pursue this ? Thanks

AlonKaftan avatar Oct 23 '22 09:10 AlonKaftan

Sorry; I would love to reproduce this and debug, but I just haven't had the time.

It would really help if you could take this example and make it run in a single binary, and have the client and server communicate using a connection wrapped with https://pkg.go.dev/google.golang.org/[email protected]/benchmark/latency to inject the 200+ms RTT latency you have. If it reproduces in that environment, it will be a lot easier to track it down than needing to run in specific AWS zones. It is also possible the problem is caused by external factors in this setup in which you are testing.

This is an example to show how the latency package can be used with gRPC: https://github.com/grpc/grpc-go/blob/26071c24f30ed6b9ba8a512f26e9f266ec8b083e/benchmark/benchmain/main.go#L267

https://github.com/grpc/grpc-go/blob/26071c24f30ed6b9ba8a512f26e9f266ec8b083e/benchmark/benchmain/main.go#L327-L331 (pass lis to grpc.Server.Serve)

dfawley avatar Oct 25 '22 16:10 dfawley

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Oct 31 '22 18:10 github-actions[bot]

Hi @dfawley We will try, currently we are busy writing workarounds for this issue as it affects our production systems.

AlonKaftan avatar Nov 07 '22 09:11 AlonKaftan

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Nov 13 '22 12:11 github-actions[bot]

Hi @dfawley, Can you please explain how do you feel the reproduction with the lib you suggested would help track the issue we see in terms of increased latency we see specifically cross regions communication? We are not sure how this would be better than the test app supplied which reproduce the issue. e.g. in case we won't reach reproduction with using your benchmark lib in standalone app reproduction (without actual grpc communication cross regions) - would this mean in your PoV that everything is OK and there isn't any issue? (The reason for asking is that we're a bit overloaded and wish to understand what difference investing in further reproduction would get us) Again, really appreciate your assistance to get to the bottom of this

eladshoval avatar Nov 20 '22 08:11 eladshoval

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Nov 26 '22 10:11 github-actions[bot]

A big reason to use that latency injection package is the fact that this will scope the problem down to our repository, rather than the possibility that external factors in the environment in which you are testing (i.e. multiple AWS zones). Once scoped down to our repository, this will make it easily reproducible on our end, rather than having to deploy to your exact environment. Thanks.

zasweq avatar Nov 28 '22 18:11 zasweq

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

github-actions[bot] avatar Dec 04 '22 18:12 github-actions[bot]