grpc-java
grpc-java copied to clipboard
Netty is significantly slower than OkHttp
What version of gRPC-Java are you using?
Tried this with all of:
- 1.24.1
- 1.26.0
- 1.27.0
What is your environment?
Both:
- running locally on MacOSX 10.14.4
- running in Kubernetes - Ubuntu 18 image on Ubuntu 18 workers
JDK: Zulu, java 8
What did you expect to see?
Netty & Okhttp clients should have relatively similar performance.
What did you see instead?
OkHttp is about 5x faster than Netty.
Steps to reproduce the bug
See methodology here: https://github.com/grpc/grpc-java/issues/6685
In those experiments, a server-side streaming RPC is set up where Netty is benchmarked at receiving 40k messages/sec while OkHttp receives 200k messages/sec (with low latency).
Operating System: Debian GNU/Linux rodete
Kernel: Linux 5.2.17-1rodete3-amd64
Architecture: x86-64
# JMH version: 1.21
# VM version: JDK 1.8.0_181-google-v7, OpenJDK 64-Bit Server VM, 25.181-b01
# VM invoker: /usr/local/buildtools/java/jdk8-google-v7-64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput
# Parameters: (direct = true, transport = NETTY)
# Run progress: 2.94% complete, ETA 01:50:13
# Fork: 1 of 1
# Warmup Iteration 1: 94226.157 ops/s
# Warmup Iteration 2: 105084.676 ops/s
# Warmup Iteration 3: 102976.096 ops/s
# Warmup Iteration 4: 103634.059 ops/s
# Warmup Iteration 5: 101606.140 ops/s
# Warmup Iteration 6: 105621.753 ops/s
# Warmup Iteration 7: 101878.792 ops/s
# Warmup Iteration 8: 106758.546 ops/s
# Warmup Iteration 9: 99153.446 ops/s
# Warmup Iteration 10: 98462.254 ops/s
Iteration 1: 101904.359 ops/s5m 31s]
Iteration 2: 106616.808 ops/s5m 41s]
Iteration 3: 103847.913 ops/s5m 51s]
Iteration 4: 110363.031 ops/s6m 1s]
Iteration 5: 98187.265 ops/s[6m 11s]
Iteration 6: 101939.123 ops/s6m 21s]
Iteration 7: 96408.948 ops/s[6m 31s]
Iteration 8: 104320.740 ops/s6m 41s]
Iteration 9: 97434.821 ops/s[6m 51s]
Iteration 10: 104429.864 ops/s7m 1s]
# JMH version: 1.21
# VM version: JDK 1.8.0_181-google-v7, OpenJDK 64-Bit Server VM, 25.181-b01
# VM invoker: /usr/local/buildtools/java/jdk8-google-v7-64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput
# Parameters: (direct = true, transport = OKHTTP)
# Run progress: 5.88% complete, ETA 01:47:01
# Fork: 1 of 1
# Warmup Iteration 1: 620507.669 ops/s
# Warmup Iteration 2: 677703.279 ops/s
# Warmup Iteration 3: 688452.626 ops/s
# Warmup Iteration 4: 679208.368 ops/s
# Warmup Iteration 5: 670879.479 ops/s
# Warmup Iteration 6: 695535.095 ops/s
# Warmup Iteration 7: 702267.289 ops/s
# Warmup Iteration 8: 705580.656 ops/s
# Warmup Iteration 9: 705918.318 ops/s
# Warmup Iteration 10: 703988.248 ops/s
Iteration 1: 705826.559 ops/s8m 52s]
Iteration 2: 691693.882 ops/s9m 2s]
Iteration 3: 685170.345 ops/s9m 12s]
Iteration 4: 679199.517 ops/s9m 22s]
Iteration 5: 682138.586 ops/s9m 32s]
Iteration 6: 682236.770 ops/s9m 42s]
Iteration 7: 677605.768 ops/s9m 52s]
Iteration 8: 677646.631 ops/s10m 2s]
Iteration 9: 684184.550 ops/s10m 12s]
// last iteration encounters an error.
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.streamingCallsMessageThroughput true INPROCESS thrpt 10 4529942.662 ± 21189.484 ops/s
TransportBenchmark.streamingCallsMessageThroughput true NETTY thrpt 10 102545.287 ± 6563.721 ops/s
TransportBenchmark.streamingCallsMessageThroughput true OKHTTP thrpt 10 684212.977 ± 13378.609 ops/s
TransportBenchmark.streamingCallsMessageThroughput false INPROCESS thrpt 10 118260.952 ± 1285.393 ops/s
TransportBenchmark.streamingCallsMessageThroughput false NETTY thrpt 10 101511.581 ± 4171.041 ops/s
TransportBenchmark.streamingCallsMessageThroughput false OKHTTP thrpt 10 694386.589 ± 13061.924 ops/s
We did observe the issue of Netty being significantly slower (~6 to 7 times slower from the result above) than OkHttp in streaming RPCs by running our transport benchmark.
We have not investigated the real cause of big performance gap between Netty and OkHttp. @ejona86 was suspecting that OkHttp is doing a good job combing buffers while Netty is doing a list of small fragments, but we do not know for sure. We may have some optimization for Netty to be done in the future.
We tried some workaround to improve Netty's performance for some sort. You could try to do manual flow control (like this example) with each time requesting for 5~10 response messages instead of one (change requestStream.request(1)
). This allows server to send more messages each time and tries to saturate the pipe. We tried this trick with request(5)
and got a 4-time performance improvement on Netty.
To clarify a bit, if this was write path I could believe the buffer combining. But this changes just based on the client, so this is read path. So I believe this is because OkHttp uses locks to access buffered data whereas Netty has to communicate with the network thread which adds latency. A while back I made an optimization that would help with this, but I only enabled it for server-side, so we don't know if it helps. (There was also a much older PR that did the same thing, but was considered too complicated. We've known about this for a while, but nobody has really been complaining...)
could try to do manual flow control (like this example) with each time requesting for 5~10 response messages instead of one (change requestStream.request(1)).
Actually, just do request(5) once, and then from then on do request(1) when you complete processing a message. This adds a 4 message "buffer" between the two threads. We saw a bit more performance improvement with request(10), but it was also diminishing returns.
i was also able to reproduce the result. with the request 5 hack, netty is as fast as okhttp. with request(10), netty is about 14% faster.
request(5)
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.streamingCallsMessageThroughput false NETTY thrpt 10 1266983.028 ± 25327.357 ops/s
TransportBenchmark.streamingCallsMessageThroughput false OKHTTP thrpt 10 1240116.098 ± 24549.262 ops/s
request(10)
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.streamingCallsMessageThroughput false NETTY thrpt 10 1298600.337 ± 9435.010 ops/s
TransportBenchmark.streamingCallsMessageThroughput false OKHTTP thrpt 10 1132868.537 ± 16746.638 ops/s
os: 5.6.10-arch1-1 cpu: : Intel(R) Core(TM) i7-8650U CPU (Kaby Lake) jre: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
diff
diff --git a/stub/src/main/java/io/grpc/stub/ClientCalls.java b/stub/src/main/java/io/grpc/stub/ClientCalls.java
index aa507bbd0..315fd37f7 100644
--- a/stub/src/main/java/io/grpc/stub/ClientCalls.java
+++ b/stub/src/main/java/io/grpc/stub/ClientCalls.java
@@ -560,7 +560,7 @@ public final class ClientCalls {
// TODO(ejona86): determine how to allow ClientCall.cancel() in case of application error.
private static final class BlockingResponseStream<T> implements Iterator<T> {
// Due to flow control, only needs to hold up to 2 items: 1 for value, 1 for close.
- private final BlockingQueue<Object> buffer = new ArrayBlockingQueue<>(2);
+ private final BlockingQueue<Object> buffer = new ArrayBlockingQueue<>(6);
private final StartableListener<T> listener = new QueuingListener();
private final ClientCall<?, T> call;
/** May be null. */
@@ -680,7 +680,7 @@ public final class ClientCalls {
@Override
void onStart() {
- call.request(1);
+ call.request(5);
}
}
}
base line
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.streamingCallsMessageThroughput false NETTY thrpt 10 310526.552 ± 3915.171 ops/s
TransportBenchmark.streamingCallsMessageThroughput false OKHTTP thrpt 10 735601.009 ± 19505.660 ops/s
with #5820
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.streamingCallsMessageThroughput false NETTY thrpt 10 324254.989 ± 3776.471 ops/s
TransportBenchmark.streamingCallsMessageThroughput false OKHTTP thrpt 10 826365.430 ± 37439.794 ops/s
I found a bug in #5820, OkHttp with executor is still faster but it looks much better than before (note: Netty has significantly higher byte throughput). with the fix result looks like following
# JMH version: 1.21
# VM version: JDK 1.8.0_232, OpenJDK 64-Bit Server VM, 25.232-b09
# VM invoker: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
# VM options: -server -Xms2g -Xmx2g
# Warmup: 10 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 10 threads, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput
# Parameters: (direct = true, transport = NETTY)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
898012.915 ±(99.9%) 9824.640 ops/s [Average]
(min, avg, max) = (890221.601, 898012.915, 909516.080), stdev = 6498.391
CI (99.9%): [888188.275, 907837.555] (assumes normal distribution)
# Parameters: (direct = true, transport = OKHTTP)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
896438.047 ±(99.9%) 18255.955 ops/s [Average]
(min, avg, max) = (880250.988, 896438.047, 916538.135), stdev = 12075.184
CI (99.9%): [878182.092, 914694.001] (assumes normal distribution)
# Parameters: (direct = false, transport = NETTY)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
997336.147 ±(99.9%) 17702.637 ops/s [Average]
(min, avg, max) = (971035.236, 997336.147, 1010493.962), stdev = 11709.198
CI (99.9%): [979633.510, 1015038.784] (assumes normal distribution)
# Parameters: (direct = false, transport = OKHTTP)
Result "io.grpc.benchmarks.TransportBenchmark.streamingCallsMessageThroughput":
1202938.828 ±(99.9%) 26055.783 ops/s [Average]
(min, avg, max) = (1177429.510, 1202938.828, 1223316.251), stdev = 17234.287
CI (99.9%): [1176883.045, 1228994.611] (assumes normal distribution)
Hi, I noticed exactly the same issue when I compared Rest (jetty) vs GRPC (netty) and GRPC was 3-5 times slower https://github.com/AlexeyPirogov/RestVsGrpc. @ejona86 kindly pointed me to this thread and advised to check GRPC-HttpOk.
And GRPC-HttpOk is faster than REST (make sense).
benchmark result with updated #5820. the result seems positive without noticeable regression.
before: https://pastebin.com/kNLRBFvC after: https://pastebin.com/ZV7B9Ufg
With #5820, async streaming went from 334,472 msg/s (Netty) and 802,237 (OkHttp) to 901,221 (Netty) and 1,123,479 (OkHttp). So you can see that I made Netty faster than OkHttp, except that I made a "mistake" and made OkHttp faster as well :smile:. Netty is now 80% of the OkHttp speed, which is a bit disappointing still, but probably "good enough" for now.
Blocking performance however is at 704,464 for Netty and 1,298,180 OkHttp, which means Netty is at 55% of the OkHttp speed, but 75% of its speed before my change. OkHttp just generally is faster with blocking than async. Fixing blocking would mean tearing down separation of concerns to avoid this thread hop, which seems invasive and requires even more thread safety.
The lower async Netty performance could be caused by several things (I have no idea if these are issues, just possible theories for investigation), for example higher CPU usage in Netty itself, or the extra work for returning WINDOW_UPDATE flow control in Netty (enqueuing a Runnable to another thread for each message vs a lock in OkHttp).
These message rates are much higher than most users should need, and a very small amount of batching (e.g., using a repeated field with 5 entries vs 5 separate grpc messages) probably provides much better results for the user, even with OkHttp. Sending zero-byte or few-byte messages is just typically fairly unnecessary.
So I'm going to close this issue. If users still feel driven to use OkHttp instead, then I'd please request they file another issue and tell us what sort of thing they are doing (e.g., async vs blocking, why so many messages) and we can try and dive deeper.
Oh, and I wanted to point out this is all optimizing receiving. Sending could see some gains via corking. But corking has kept being not quite important enough to figure out the precise API we'd expose, as there is a wide design space available and many opinions.
If you are using the async stub API, you can workaround this issue in v1.30 and earlier, by:
// If client is receiving many small messages (This will not work for the blocking iterator stub)
public class MorePipeliningClientInterceptor implements ClientInterceptor {
@Override
public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method,
CallOptions callOptions, Channel next) {
return new SimpleForwardingClientCall<ReqT, RespT>(next.newCall(method, callOptions)) {
@Override
public void start(Listener<RespT> responseListener, Metadata headers) {
super.start(responseListener, headers);
super.request(5);
}
};
}
}
stub = stub.withInterceptors(new MorePipeliningClientInterceptor());
...
// If server is receiving many small messages
public StreamObserver<Request> someMethod(StreamObserver<Response> observer) {
ServerCallStreamObserver<Response> serverCallObserver = (ServerCallStreamObserver<Response>) observer;
serverCallObserver.request(5);
...
}
The changes that fixed this issue were reverted, fixed, and then reverted again. See https://github.com/grpc/grpc-java/issues/7168
Should this be re-opened?
As I mentioned at https://github.com/grpc/grpc-java/issues/7168#issuecomment-754818059 , MigratingDeframer seems to have a threading issue. Unclear what it is at the moment, but it will need some serious investigation.
With #5820, async streaming went from 334,472 msg/s (Netty) and 802,237 (OkHttp) to 901,221 (Netty) and 1,123,479 (OkHttp)...
...
Blocking performance however is at 704,464 for Netty and 1,298,180 OkHttp...
@ejona86 thanks for providing specific numbers. Was flow control enabled in this case or disabled? If enabled, I'll try to disable it.
I found your posts describing how to disable Flow Control for Async Stubs (using disableAutoRequestWithInitial(1)). Is it possible to disable flow control for Blocking Stubs (of course if Blocking Stubs uses flow control)?
Was flow control enabled in this case or disabled?
Enabled. It is always enabled.
Is it possible to disable flow control for Blocking Stubs (of course if Blocking Stubs uses flow control)?
To be clear, the MorePipeliningClientInterceptor doesn't disable flow control but requests messages earlier to hide the latency the current grpc-netty code is introducing. Unfortunately the trick doesn't work with blocking because of this buffer size. Blocking does a neat trick with ThreadlessExecutor but it means that it would notice the additional messages and could deadlock. It would be possible to work around that with some additional hackery (replacing callOptions.withExecutor()
and then some nonsense to schedule fake Runnables in the original Executor), but it'd be too far down the rabbit hole.
Thank you @ejona86 .
Is it possible to completely disable Flow Control without going too deep?
I compared performance of batched server streaming (streaming groups of objects - 'stream repeated X') vs normal streaming ('stream X'). Batched version is significantly faster (which makes sense if flow control enabled). I'm ok to use batched version, just it could be a little confusing for other developers (they may think, does this groups mean anything?).
That is why I wanted to check if it is possible to get same performance without batches but with disabled flow control.
Is it possible to completely disable Flow Control without going too deep?
As I said, flow control is always enabled. And the problem I mentioned before with the blocking stub would also apply if you disabled flow control.
I'd suggest using the async stub with MorePipeliningClientInterceptor. The iterator-based blocking stub is convenient, but it isn't without its problems (no easy way to kill the RPC, trivial to leak the RPC, exceptions during iteration). If you just use the async API and put the input from resulting callbacks on a BlockingQueue you can get similar behavior without too much boilerplate. I wish you didn't need to, but it isn't that bad of an option.
If you are using the async stub API, you can workaround this issue in v1.30 and earlier, by:
// If client is receiving many small messages (This will not work for the blocking iterator stub) public class MorePipeliningClientInterceptor implements ClientInterceptor { @Override public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) { return new SimpleForwardingClientCall<ReqT, RespT>(next.newCall(method, callOptions)) { @Override public void start(Listener<RespT> responseListener, Metadata headers) { super.start(responseListener, headers); super.request(5); } }; } } stub = stub.withInterceptors(new MorePipeliningClientInterceptor()); ...
// If server is receiving many small messages public StreamObserver<Request> someMethod(StreamObserver<Response> observer) { ServerCallStreamObserver<Response> serverCallObserver = (ServerCallStreamObserver<Response>) observer; serverCallObserver.request(5); ... }
With MorePipeliningClientInterceptor, message will be out of order? Executor submit 5 message tasks, tasks will be execute concurrently.
With MorePipeliningClientInterceptor, message will be out of order? Executor submit 5 message tasks, tasks will be execute concurrently.
StreamObserver is not thread-safe, so gRPC will not call onNext()
if onNext()
is currently in-progress. Messages will be delivered in-order.
any update on this?
The trouble seen by migrating deframer is suspected to be related to issues with RejectedExecutionException in ThreadlessExecutor. We keep trying to enable RejectedExecutionException to find bugs, and we keep finding and fixing bugs. The most recent attempt was reverted in https://github.com/grpc/grpc-java/pull/10574 .
The trouble seen by migrating deframer is suspected to be related to issues with RejectedExecutionException in ThreadlessExecutor. We keep trying to enable RejectedExecutionException to find bugs, and we keep finding and fixing bugs. The most recent attempt was reverted in #10574 .
any "reliable" repro? I can take a peek.
@natl-set, are you impacted by the worse performance? If so, there are workarounds available for both client and server: https://github.com/grpc/grpc-java/issues/6696#issuecomment-647560419