grpc-java Netty is significantly slower than OkHttp

What version of gRPC-Java are you using?

Tried this with all of:

1.24.1
1.26.0
1.27.0

What is your environment?

Both:

running locally on MacOSX 10.14.4
running in Kubernetes - Ubuntu 18 image on Ubuntu 18 workers

JDK: Zulu, java 8

What did you expect to see?

Netty & Okhttp clients should have relatively similar performance.

What did you see instead?

OkHttp is about 5x faster than Netty.

Steps to reproduce the bug

See methodology here: https://github.com/grpc/grpc-java/issues/6685

In those experiments, a server-side streaming RPC is set up where Netty is benchmarked at receiving 40k messages/sec while OkHttp receives 200k messages/sec (with low latency).

Feb 11 '20 01:02 chrisschek

Operating System: Debian GNU/Linux rodete
Kernel: Linux 5.2.17-1rodete3-amd64
Architecture: x86-64

# JMH version: 1.21
# VM version: JDK 1.8.0_181-google-v7, OpenJDK 64-Bit Server VM, 25.181-b01
# VM invoker: /usr/local/buildtools/java/jdk8-google-v7-64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput
# Parameters: (direct = true, transport = NETTY)

# Run progress: 2.94% complete, ETA 01:50:13
# Fork: 1 of 1

# Warmup Iteration   1: 94226.157 ops/s
# Warmup Iteration   2: 105084.676 ops/s
# Warmup Iteration   3: 102976.096 ops/s
# Warmup Iteration   4: 103634.059 ops/s
# Warmup Iteration   5: 101606.140 ops/s
# Warmup Iteration   6: 105621.753 ops/s
# Warmup Iteration   7: 101878.792 ops/s
# Warmup Iteration   8: 106758.546 ops/s
# Warmup Iteration   9: 99153.446 ops/s
# Warmup Iteration  10: 98462.254 ops/s
Iteration   1: 101904.359 ops/s5m 31s]
Iteration   2: 106616.808 ops/s5m 41s]
Iteration   3: 103847.913 ops/s5m 51s]
Iteration   4: 110363.031 ops/s6m 1s]
Iteration   5: 98187.265 ops/s[6m 11s]
Iteration   6: 101939.123 ops/s6m 21s]
Iteration   7: 96408.948 ops/s[6m 31s]
Iteration   8: 104320.740 ops/s6m 41s]
Iteration   9: 97434.821 ops/s[6m 51s]
Iteration  10: 104429.864 ops/s7m 1s]



# JMH version: 1.21
# VM version: JDK 1.8.0_181-google-v7, OpenJDK 64-Bit Server VM, 25.181-b01
# VM invoker: /usr/local/buildtools/java/jdk8-google-v7-64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput
# Parameters: (direct = true, transport = OKHTTP)

# Run progress: 5.88% complete, ETA 01:47:01
# Fork: 1 of 1

# Warmup Iteration   1: 620507.669 ops/s
# Warmup Iteration   2: 677703.279 ops/s
# Warmup Iteration   3: 688452.626 ops/s
# Warmup Iteration   4: 679208.368 ops/s
# Warmup Iteration   5: 670879.479 ops/s
# Warmup Iteration   6: 695535.095 ops/s
# Warmup Iteration   7: 702267.289 ops/s
# Warmup Iteration   8: 705580.656 ops/s
# Warmup Iteration   9: 705918.318 ops/s
# Warmup Iteration  10: 703988.248 ops/s
Iteration   1: 705826.559 ops/s8m 52s]
Iteration   2: 691693.882 ops/s9m 2s]
Iteration   3: 685170.345 ops/s9m 12s]
Iteration   4: 679199.517 ops/s9m 22s]
Iteration   5: 682138.586 ops/s9m 32s]
Iteration   6: 682236.770 ops/s9m 42s]
Iteration   7: 677605.768 ops/s9m 52s]
Iteration   8: 677646.631 ops/s10m 2s]
Iteration   9: 684184.550 ops/s10m 12s]
// last iteration encounters an error.

Benchmark                                           (direct)  (transport)   Mode  Cnt        Score       Error  Units
TransportBenchmark.streamingCallsMessageThroughput      true    INPROCESS  thrpt   10  4529942.662 ± 21189.484  ops/s
TransportBenchmark.streamingCallsMessageThroughput      true        NETTY  thrpt   10   102545.287 ±  6563.721  ops/s
TransportBenchmark.streamingCallsMessageThroughput      true       OKHTTP  thrpt   10   684212.977 ± 13378.609  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false    INPROCESS  thrpt   10   118260.952 ±  1285.393  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false        NETTY  thrpt   10   101511.581 ±  4171.041  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false       OKHTTP  thrpt   10   694386.589 ± 13061.924  ops/s

Mar 05 '20 01:03 voidzcy

We did observe the issue of Netty being significantly slower (~6 to 7 times slower from the result above) than OkHttp in streaming RPCs by running our transport benchmark.

We have not investigated the real cause of big performance gap between Netty and OkHttp. @ejona86 was suspecting that OkHttp is doing a good job combing buffers while Netty is doing a list of small fragments, but we do not know for sure. We may have some optimization for Netty to be done in the future.

We tried some workaround to improve Netty's performance for some sort. You could try to do manual flow control (like this example) with each time requesting for 5~10 response messages instead of one (change requestStream.request(1)). This allows server to send more messages each time and tries to saturate the pipe. We tried this trick with request(5) and got a 4-time performance improvement on Netty.

Mar 10 '20 08:03 voidzcy

To clarify a bit, if this was write path I could believe the buffer combining. But this changes just based on the client, so this is read path. So I believe this is because OkHttp uses locks to access buffered data whereas Netty has to communicate with the network thread which adds latency. A while back I made an optimization that would help with this, but I only enabled it for server-side, so we don't know if it helps. (There was also a much older PR that did the same thing, but was considered too complicated. We've known about this for a while, but nobody has really been complaining...)

could try to do manual flow control (like this example) with each time requesting for 5~10 response messages instead of one (change requestStream.request(1)).

Actually, just do request(5) once, and then from then on do request(1) when you complete processing a message. This adds a 4 message "buffer" between the two threads. We saw a bit more performance improvement with request(10), but it was also diminishing returns.

Mar 10 '20 15:03 ejona86

i was also able to reproduce the result. with the request 5 hack, netty is as fast as okhttp. with request(10), netty is about 14% faster.

request(5)

Benchmark                                           (direct)  (transport)   Mode  Cnt        Score       Error  Units
TransportBenchmark.streamingCallsMessageThroughput     false        NETTY  thrpt   10  1266983.028 ± 25327.357  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false       OKHTTP  thrpt   10  1240116.098 ± 24549.262  ops/s

request(10)

Benchmark                                           (direct)  (transport)   Mode  Cnt        Score       Error  Units
TransportBenchmark.streamingCallsMessageThroughput     false        NETTY  thrpt   10  1298600.337 ±  9435.010  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false       OKHTTP  thrpt   10  1132868.537 ± 16746.638  ops/s

os: 5.6.10-arch1-1 cpu: : Intel(R) Core(TM) i7-8650U CPU (Kaby Lake) jre: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08

diff

diff --git a/stub/src/main/java/io/grpc/stub/ClientCalls.java b/stub/src/main/java/io/grpc/stub/ClientCalls.java
index aa507bbd0..315fd37f7 100644
--- a/stub/src/main/java/io/grpc/stub/ClientCalls.java
+++ b/stub/src/main/java/io/grpc/stub/ClientCalls.java
@@ -560,7 +560,7 @@ public final class ClientCalls {
   // TODO(ejona86): determine how to allow ClientCall.cancel() in case of application error.
   private static final class BlockingResponseStream<T> implements Iterator<T> {
     // Due to flow control, only needs to hold up to 2 items: 1 for value, 1 for close.
-    private final BlockingQueue<Object> buffer = new ArrayBlockingQueue<>(2);
+    private final BlockingQueue<Object> buffer = new ArrayBlockingQueue<>(6);
     private final StartableListener<T> listener = new QueuingListener();
     private final ClientCall<?, T> call;
     /** May be null. */
@@ -680,7 +680,7 @@ public final class ClientCalls {
 
       @Override
       void onStart() {
-        call.request(1);
+        call.request(5);
       }
     }
   }

May 15 '20 22:05 creamsoup

base line

Benchmark                                           (direct)  (transport)   Mode  Cnt       Score       Error  Units
TransportBenchmark.streamingCallsMessageThroughput     false        NETTY  thrpt   10  310526.552 ±  3915.171  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false       OKHTTP  thrpt   10  735601.009 ± 19505.660  ops/s

with #5820

Benchmark                                           (direct)  (transport)   Mode  Cnt       Score       Error  Units
TransportBenchmark.streamingCallsMessageThroughput     false        NETTY  thrpt   10  324254.989 ±  3776.471  ops/s
TransportBenchmark.streamingCallsMessageThroughput     false       OKHTTP  thrpt   10  826365.430 ± 37439.794  ops/s

May 15 '20 23:05 creamsoup

I found a bug in #5820, OkHttp with executor is still faster but it looks much better than before (note: Netty has significantly higher byte throughput). with the fix result looks like following

# JMH version: 1.21
# VM version: JDK 1.8.0_232, OpenJDK 64-Bit Server VM, 25.232-b09
# VM invoker: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 10 threads, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput

# Parameters: (direct = true, transport = NETTY)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
  898012.915 ±(99.9%) 9824.640 ops/s [Average]
  (min, avg, max) = (890221.601, 898012.915, 909516.080), stdev = 6498.391
  CI (99.9%): [888188.275, 907837.555] (assumes normal distribution)

# Parameters: (direct = true, transport = OKHTTP)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
  896438.047 ±(99.9%) 18255.955 ops/s [Average]
  (min, avg, max) = (880250.988, 896438.047, 916538.135), stdev = 12075.184
  CI (99.9%): [878182.092, 914694.001] (assumes normal distribution)

# Parameters: (direct = false, transport = NETTY)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
  997336.147 ±(99.9%) 17702.637 ops/s [Average]
  (min, avg, max) = (971035.236, 997336.147, 1010493.962), stdev = 11709.198
  CI (99.9%): [979633.510, 1015038.784] (assumes normal distribution)

# Parameters: (direct = false, transport = OKHTTP)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
  1202938.828 ±(99.9%) 26055.783 ops/s [Average]
  (min, avg, max) = (1177429.510, 1202938.828, 1223316.251), stdev = 17234.287
  CI (99.9%): [1176883.045, 1228994.611] (assumes normal distribution)

May 27 '20 16:05 creamsoup

Hi, I noticed exactly the same issue when I compared Rest (jetty) vs GRPC (netty) and GRPC was 3-5 times slower https://github.com/AlexeyPirogov/RestVsGrpc. @ejona86 kindly pointed me to this thread and advised to check GRPC-HttpOk.

And GRPC-HttpOk is faster than REST (make sense).

Jun 10 '20 03:06 AlexeyPirogov

benchmark result with updated #5820. the result seems positive without noticeable regression.

before: https://pastebin.com/kNLRBFvC after: https://pastebin.com/ZV7B9Ufg

Jun 16 '20 23:06 creamsoup

With #5820, async streaming went from 334,472 msg/s (Netty) and 802,237 (OkHttp) to 901,221 (Netty) and 1,123,479 (OkHttp). So you can see that I made Netty faster than OkHttp, except that I made a "mistake" and made OkHttp faster as well :smile:. Netty is now 80% of the OkHttp speed, which is a bit disappointing still, but probably "good enough" for now.

Blocking performance however is at 704,464 for Netty and 1,298,180 OkHttp, which means Netty is at 55% of the OkHttp speed, but 75% of its speed before my change. OkHttp just generally is faster with blocking than async. Fixing blocking would mean tearing down separation of concerns to avoid this thread hop, which seems invasive and requires even more thread safety.

The lower async Netty performance could be caused by several things (I have no idea if these are issues, just possible theories for investigation), for example higher CPU usage in Netty itself, or the extra work for returning WINDOW_UPDATE flow control in Netty (enqueuing a Runnable to another thread for each message vs a lock in OkHttp).

These message rates are much higher than most users should need, and a very small amount of batching (e.g., using a repeated field with 5 entries vs 5 separate grpc messages) probably provides much better results for the user, even with OkHttp. Sending zero-byte or few-byte messages is just typically fairly unnecessary.

So I'm going to close this issue. If users still feel driven to use OkHttp instead, then I'd please request they file another issue and tell us what sort of thing they are doing (e.g., async vs blocking, why so many messages) and we can try and dive deeper.

Jun 19 '20 17:06 ejona86

Oh, and I wanted to point out this is all optimizing receiving. Sending could see some gains via corking. But corking has kept being not quite important enough to figure out the precise API we'd expose, as there is a wide design space available and many opinions.

Jun 19 '20 17:06 ejona86

If you are using the async stub API, you can workaround this issue in v1.30 and earlier, by:

// If client is receiving many small messages (This will not work for the blocking iterator stub)
public class MorePipeliningClientInterceptor implements ClientInterceptor {
  @Override
  public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method,
      CallOptions callOptions, Channel next) {
    return new SimpleForwardingClientCall<ReqT, RespT>(next.newCall(method, callOptions)) {
      @Override
      public void start(Listener<RespT> responseListener, Metadata headers) {
        super.start(responseListener, headers);
        super.request(5);
      }
    };
  }
}

stub = stub.withInterceptors(new MorePipeliningClientInterceptor());
...

// If server is receiving many small messages
public StreamObserver<Request> someMethod(StreamObserver<Response> observer) {
  ServerCallStreamObserver<Response> serverCallObserver = (ServerCallStreamObserver<Response>) observer;
  serverCallObserver.request(5);
  ...
}

Jun 22 '20 14:06 ejona86

The changes that fixed this issue were reverted, fixed, and then reverted again. See https://github.com/grpc/grpc-java/issues/7168

Aug 31 '20 15:08 ejona86

Should this be re-opened?

Dec 01 '20 16:12 ulfjack

As I mentioned at https://github.com/grpc/grpc-java/issues/7168#issuecomment-754818059 , MigratingDeframer seems to have a threading issue. Unclear what it is at the moment, but it will need some serious investigation.

Jan 05 '21 18:01 ejona86

With #5820, async streaming went from 334,472 msg/s (Netty) and 802,237 (OkHttp) to 901,221 (Netty) and 1,123,479 (OkHttp)...

...

Blocking performance however is at 704,464 for Netty and 1,298,180 OkHttp...

@ejona86 thanks for providing specific numbers. Was flow control enabled in this case or disabled? If enabled, I'll try to disable it.

I found your posts describing how to disable Flow Control for Async Stubs (using disableAutoRequestWithInitial(1)). Is it possible to disable flow control for Blocking Stubs (of course if Blocking Stubs uses flow control)?

Aug 06 '21 01:08 AlexeyPirogov

Was flow control enabled in this case or disabled?

Enabled. It is always enabled.

Is it possible to disable flow control for Blocking Stubs (of course if Blocking Stubs uses flow control)?

To be clear, the MorePipeliningClientInterceptor doesn't disable flow control but requests messages earlier to hide the latency the current grpc-netty code is introducing. Unfortunately the trick doesn't work with blocking because of this buffer size. Blocking does a neat trick with ThreadlessExecutor but it means that it would notice the additional messages and could deadlock. It would be possible to work around that with some additional hackery (replacing callOptions.withExecutor() and then some nonsense to schedule fake Runnables in the original Executor), but it'd be too far down the rabbit hole.

Aug 06 '21 01:08 ejona86

Thank you @ejona86 .

Is it possible to completely disable Flow Control without going too deep?

I compared performance of batched server streaming (streaming groups of objects - 'stream repeated X') vs normal streaming ('stream X'). Batched version is significantly faster (which makes sense if flow control enabled). I'm ok to use batched version, just it could be a little confusing for other developers (they may think, does this groups mean anything?).

That is why I wanted to check if it is possible to get same performance without batches but with disabled flow control.

Aug 06 '21 11:08 AlexeyPirogov

Is it possible to completely disable Flow Control without going too deep?

As I said, flow control is always enabled. And the problem I mentioned before with the blocking stub would also apply if you disabled flow control.

I'd suggest using the async stub with MorePipeliningClientInterceptor. The iterator-based blocking stub is convenient, but it isn't without its problems (no easy way to kill the RPC, trivial to leak the RPC, exceptions during iteration). If you just use the async API and put the input from resulting callbacks on a BlockingQueue you can get similar behavior without too much boilerplate. I wish you didn't need to, but it isn't that bad of an option.

Aug 06 '21 15:08 ejona86

If you are using the async stub API, you can workaround this issue in v1.30 and earlier, by:

// If client is receiving many small messages (This will not work for the blocking iterator stub)
public class MorePipeliningClientInterceptor implements ClientInterceptor {
  @Override
  public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method,
      CallOptions callOptions, Channel next) {
    return new SimpleForwardingClientCall<ReqT, RespT>(next.newCall(method, callOptions)) {
      @Override
      public void start(Listener<RespT> responseListener, Metadata headers) {
        super.start(responseListener, headers);
        super.request(5);
      }
    };
  }
}

stub = stub.withInterceptors(new MorePipeliningClientInterceptor());
...

// If server is receiving many small messages
public StreamObserver<Request> someMethod(StreamObserver<Response> observer) {
  ServerCallStreamObserver<Response> serverCallObserver = (ServerCallStreamObserver<Response>) observer;
  serverCallObserver.request(5);
  ...
}

With MorePipeliningClientInterceptor, message will be out of order? Executor submit 5 message tasks, tasks will be execute concurrently.

Dec 13 '21 03:12 firebook

With MorePipeliningClientInterceptor, message will be out of order? Executor submit 5 message tasks, tasks will be execute concurrently.

StreamObserver is not thread-safe, so gRPC will not call onNext() if onNext() is currently in-progress. Messages will be delivered in-order.

Dec 13 '21 15:12 ejona86

any update on this?

Nov 26 '23 13:11 natl-set

The trouble seen by migrating deframer is suspected to be related to issues with RejectedExecutionException in ThreadlessExecutor. We keep trying to enable RejectedExecutionException to find bugs, and we keep finding and fixing bugs. The most recent attempt was reverted in https://github.com/grpc/grpc-java/pull/10574 .

Nov 27 '23 17:11 ejona86

The trouble seen by migrating deframer is suspected to be related to issues with RejectedExecutionException in ThreadlessExecutor. We keep trying to enable RejectedExecutionException to find bugs, and we keep finding and fixing bugs. The most recent attempt was reverted in #10574 .

any "reliable" repro? I can take a peek.

Nov 28 '23 01:11 natl-set

@natl-set, are you impacted by the worse performance? If so, there are workarounds available for both client and server: https://github.com/grpc/grpc-java/issues/6696#issuecomment-647560419

Nov 28 '23 21:11 ejona86

grpc-java grpc-java copied to clipboard

Netty is significantly slower than OkHttp

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

request(5)

request(10)

grpc-java
grpc-java copied to clipboard