grpc Approximately 50ms latency increase observed for synchronous vs asynchronous calls

Describe the bug

We have a very low-latency gRPC service that we have written client code for in several different programming languages. When we load test it in most languages, we are able to observe a p99 latency of < 5ms for our basic APIs.

When calling our service using elixir-grpc, we observe the following:

When we load test via several concurrent "worker" Tasks, each of which issue a lot of requests in a tight loop, the latencies look similar to what we see in other programming languages.
The faster we issue the requests, the more likely they are to have good (low) latencies.
However, if we simplify the driver program to issue requests synchronously, one at a time, in a simple loop, then every request seems to incur some unknown overhead that increases the latency by about 50ms.

To Reproduce

I can provide a simple reproducer script if requested; I have one right now that can be run against our production gRPC service but requires a (free) auth token. Alternately I could write a small echo server gRPC service and provide both the client and the server code to repro. If someone is able to help look into this please let me know which you prefer and I will update the issue.

Expected behavior

Calling APIs synchronously in a simple loop should not exhibit worse (higher) client-side latencies than calling the same APIs asynchronously and concurrently.

Logs

N/A

Protos

Can provide these upon request depending on whether you would prefer to see a reproducer against our production gRPC service or a standalone client/server reproducer.

Versions:

OS: Amazon Linux 2023, Amazon Linux 2, macOS Ventura
Erlang: 25
Elixir: 1.14
mix.lock(grpc, gun, cowboy, cowlib): Repros with both elixir-grpc 0.5.0 and 0.6.0. current lock file includes gun 2.0.1, cowboy 2.9, cowlib 2.11

Additional context

If there is anyone who is able to help look into this please let me know what additional context would be useful and we will provide it!

Jun 29 '23 15:06 cprice404

This strikes me as a mailbox congestion issue that makes a lot of sense for Erlang and Elixir applications that benefit best from systems that can be parallelized.

Jun 30 '23 14:06 sleipnir

@cprice404 if you do have a script/livebook to replicate the issue, it would be helpful!

Jul 03 '23 16:07 polvalente

@polvalente thanks, we'll try to put a simple repro together.

Jul 03 '23 18:07 cprice404

I have not been able to reproduce this so far with a local insecure/plaintext/http grpc server.

So it may have something to do with the TLS setup. Providing a completely self-contained example including the server then becomes challenging because of certificates and DNS.

So, attached is a tarball that makes grpc requests to an actual deployed (development) instance of our service. The tarball contains both a python and an elixir client example.

To run the python example, from the root dir of the expanded tarball (requires pipenv):

cd python_client
./initialize.sh
./run-client.sh

You will see the python grpc client making a bunch of requests in a loop, with a fairly consistent latency being printed out (for me, about 10-12ms, but this will depend on your network).

Then, elixir (also from the tarball root):

cd elixir_client
mix deps.get
mix run ./client.exs

Does pretty much the exact same thing as the python client, but you will notice that there is an extra ~50ms reported in the latency for each request.

I tried to instrument a local copy of elixir-grpc/gun and the extra time appears to happen in gun.ex -> receive_data/2 -> receive_headers/4 -> await/3 -> :gun.await. But the fact that so far I have only seen this with TLS enabled makes me suspicious that either I have my TLS configured wrong, or that there is something causing the TLS handshake to re-occur between requests or something like that.

And again the other observation is that if I do a bunch of concurrent async requests instead of serial synchronous requests like this, that ~50ms penalty seems to happen much less frequently (but it still happens). standalone_ping_repro.tgz

Jul 03 '23 23:07 cprice404

I also tried changing my tls options to:

tls_options = [verify: :verify_peer, depth: 100, cacerts: :certifi.cacerts()]

In case it was some issue with the :tls_certificate_check library, but that didn't help.

Jul 03 '23 23:07 cprice404

I also wrote a simple program that times gun HTTPS requests in a loop, and I don't see this issue there - but that server is HTTP 1.1. Poking around in the gun docs I saw this:

https://ninenines.eu/docs/en/gun/2.0/guide/internals_tls_over_tls/#_stream_wide_tls_over_tls

Which makes me wonder if this is related to HTTP/2 TLS with gun.

Jul 04 '23 00:07 cprice404

The difference is even more dramatic locally. The Elixir client is taking ~500ms while the Python client is taking ~200ms

Jul 04 '23 03:07 polvalente

Some things I noticed running locally:

Changing the adapter to Mint doesn't affect too much the issue at hand.
using the macOS Activity Monitor, we can see that the Python client sends around as many packets as it receives, while the elixir client sends twice the number of packets as the number it receives.

I suspect somewhere in the code we're sending an extra request that Python is either not sending or merging into another request somehow

Jul 04 '23 04:07 polvalente

I suspect somewhere in the code we're sending an extra request that Python is either not sending or merging into another request somehow

or possibly sending the TLS certs/handshake packets more often than the python client, since so far I haven't been able to reproduce this without TLS. Just a guess though.

Jul 05 '23 15:07 cprice404

It could be that somehow the TLS connection isn't being actually reused. What makes me wonder though is that I saw the same behavior in using the Mint adapter, which makes me suspect it's the library's fault and not the actual underlying client's.

Jul 06 '23 01:07 polvalente

@polvalente yeah I also observed the same thing with Mint but am not familiar enough with them to know if Mint uses a similar strategy for TLS.

However, this morning I set up a simple HTTP2 app hosted by nginx with SSL and ran my tests using plain http clients rather than grpc clients. For the python client I used httpx, and in the elixir code I just used gun directly. I got the same performance from both the python client and gun in this case (in other words, neither of the two clients exhibited the extra 50ms overhead that I am observing with elixir-grpc).

So this is one more data point suggesting that the issue lies somewhere in the elixir-grpc library.

Here's the basic script I was running for gun. Based on my attempt to step through the elixir-grpc gun.ex code, I think this is roughly the same life cycle.

open_opts = %{
      protocols: [:http2],
      transport: :tls,
      tls_opts: [
        verify: :verify_peer,
        cacertfile: "/path/to/minica/certs.pem"
      ],
    }
     {:ok, conn_pid} = :gun.open(String.to_charlist("myEc2Host"), 8000, open_opts)
    case :gun.await_up(conn_pid) do
      {:ok, :http2} -> Logger.info("Initialized HTTP2 connection")
      {:error, e} ->
        Logger.error("Failed to initialize HTTP2 connection: #{inspect(e)}")
        raise inspect(e)
    end

    for n <- 1..1_000_000 do
      start_time = :os.system_time(:milli_seconds)
      stream_ref = :gun.get(conn_pid, "/")
      headers = :gun.await(conn_pid, stream_ref)
      Logger.info("Headers: #{inspect(headers)}")
      body = :gun.await(conn_pid, stream_ref)
      Logger.info("Body: #{inspect(body)}")
      duration = :os.system_time(:milli_seconds) - start_time
      Logger.info("Finished request; duration: #{duration}")
    end

If it is looking like it's not the http client's issue, does anyone have any advice on next steps to debug what might be going wrong in the elixir-grpc library code?

Jul 06 '23 15:07 cprice404

It could be that somehow the TLS connection isn't being actually reused.

At the h2 stream level that may be possible. At the TCP/connection level I don't think that's the case because I have been monitoring my runs with lsof and it appears that only one connection gets opened for the duration of the program.

Jul 06 '23 15:07 cprice404

Some things I noticed running locally:

Changing the adapter to Mint doesn't affect too much the issue at hand.

using the macOS Activity Monitor, we can see that the Python client sends around as many packets as it receives, while the elixir client sends twice the number of packets as the number it receives.

I suspect somewhere in the code we're sending an extra request that Python is either not sending or merging into another request somehow

We are seeing a 3x increased latency in Elixir vs C# while talking to the local insecure gRPC server. Also using Mint and exactly with the same symptom of Elixir being more chatty https://github.com/NFIBrokerage/spear/issues/92

(maybe a red herring though)

Oct 09 '23 20:10 lessless

Sorry for the long delay. If changing clients does not seem to affect the overall performance, there seems to be some sort of protocol implementation difference on the layers above. If anyone wants to look into this, I'm happy to accept any help :)

Nov 27 '23 02:11 polvalente

I apparently have the same issue. I'm not able to reproduce it in local environment, only in production.

Aug 01 '24 23:08 v0idpwn

Not sure if the cause is the same for you (it seems like so, because you mentioned you're using amazon linux), @cprice404, but here, the issue was Amazon's load balancer.

Nagle algorithm + TCP delay interaction were adding a bit over 40ms to requests. Confirmed this on a TCP dump. Skipping the LB altogether solves the issue.

Aug 06 '24 12:08 v0idpwn

@v0idpwn thanks for the info! some permutations of the testing I mentioned above did indeed involve an AWS NLB. However, I don't observe the same perf issues with grpc clients in other programming languages, even when talking to the same server, so I still believe that there is/was something going awry in the elixir grpc code path.

I haven't tried to repro this since last year though, so I'm not sure if any recent updates to elixir-grpc or underlying http libs may have improved the situations.

Aug 06 '24 17:08 cprice404

That's interesting. In my experience there's significant difference in the frames sent by this library and other grpc clients, but I didn't try them with the load balancer. Will try soon and update here. Maybe it's just a matter of sending 1 or 2 additional frames and it will work with the LB.

Aug 06 '24 19:08 v0idpwn

grpc grpc copied to clipboard

Approximately 50ms latency increase observed for synchronous vs asynchronous calls

grpc
grpc copied to clipboard