grpc-java
grpc-java copied to clipboard
channel warm up per thread
Hi, all. I'm doing an experiment with grpc and find out that the first usage of grpc in a thread costs much more time than the subsequent ones doing the same thing. I tested it on a 2-core laptop and 24-core machines, and the same phenomena occurred on all of them. So I'm wondering if I'm using grpc wrong or something in grpc-java could be improved to avoid such situation. Below is my code and part of the report.
import java.text.SimpleDateFormat;
import java.util.Date;
import me.gdbc.mocker.MockServer;
import me.proto.query.QueryProtocol.QueryRequest;
import me.proto.query.QueryProtocol.QueryResponse;
import me.proto.query.QueryServiceInterfaceGrpc.METHOD_QUERY;
import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.netty.NettyChannelBuilder;
public class WarmUp {
public static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
public static QueryRequest request = QueryRequest.newBuilder().setQueryString("").build();
public static class QueryThread extends Thread {
private String name;
private ManagedChannel client;
public QueryThread(String name, ManagedChannel client) {
this.name = name;
this.client = client;
}
public void run() {
for (int i = 0 ; i < 300; i++) {
long start = System.currentTimeMillis();
try {
//ClientCall<QueryRequest, QueryResponse> call = client.newCall(METHOD_QUERY, CallOptions.DEFAULT);
QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(client, METHOD_QUERY, CallOptions.DEFAULT, request);
//QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(call, request);
long duration = System.currentTimeMillis() - start;
String show = "" + ret + "\n" + df.format(new Date()) + "\n" + "cost " + duration + "ms in thread " + name;
System.out.println(show);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
int port = 9012;
MockServer server = new MockServer(port);
try {
server.start();
ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", port).usePlaintext(true).build();
io.grpc.stub.ClientCalls.blockingUnaryCall(channel, METHOD_QUERY, CallOptions.DEFAULT, request);
for (int i = 0; i < 4; i++) {
QueryThread th = new QueryThread(String.valueOf(i), channel);
th.start();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
May 01, 2016 2:28:09 PM me.gdbc.mocker.MockServer start
INFO: Server started, listenning on 9012
May 01, 2016 2:28:09 PM io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@326de728] Created with target 127.0.0.1:9012
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 76ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 67ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 68ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 66ms in thread 2
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 8ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 8ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 10ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 10ms in thread 2
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 8ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 10ms in thread 2
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 4ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 6ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 7ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 6ms in thread 3
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 6ms in thread 1
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 6ms in thread 0
result {
props {
intVal: 1
}
}
2016-05-01 14:28:10
cost 6ms in thread 3
result {
props {
intVal: 1
}
}
A few things:
- You should measure this kind of time with System.nanotime().
- The JVM takes a large number of iterations before it will compile JIT'd code. This can be as many as 10K iterations, which it doesn't look like your test hits.
- Please run the benchmark code (in benchmarks/...) to get an idea of what a reasonable latency is.
Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.
Could it be because the actual connection is only made when the first call comes through, and subsequent calls don't pay the connection setup penalty?
On Mon, May 2, 2016 at 9:07 AM, Yize Li [email protected] wrote:
Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/grpc/grpc-java/issues/1758#issuecomment-216278484
@makdharma, that seems very likely.
@LeeYeeze, the point is that you aren't comparing "first RPC on a thread" and instead saying "gRPC is slow to start". Since each thread is started at the same time I'd expect them to each have similar performance. In order to show it was creation a new thread that was slow you'd need to create new threads throughout the duration of the test.
@makdharma @ejona86 I invoke a blockingUnaryCall before I start all the threads, so the connection is already setup. I also test it with threads separated in time throughout the duration of the test, almost the same except that without contention the first call cost less time but still slow. My laptop is not at hand right now, so I will paste my report later here.
@LeeYeeze, I've reproduced what you see. I'm on Linux. Swapping my CPU governor from ondemand to performance didn't change behavior much.
I modified your code, hitting the hello-world-server:
import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.examples.helloworld.GreeterGrpc;
import io.grpc.examples.helloworld.HelloReply;
import io.grpc.examples.helloworld.HelloRequest;
import io.grpc.netty.NettyChannelBuilder;
import io.grpc.stub.ClientCalls;
import java.util.Date;
import java.util.concurrent.TimeUnit;
public class ThreadWarmUp {
private static final HelloRequest request = HelloRequest.newBuilder().setName("world").build();
public static class QueryThread extends Thread {
private String name;
private ManagedChannel channel;
public QueryThread(String name, ManagedChannel channel) {
this.name = name;
this.channel = channel;
}
public void run() {
final int type = 0;
for (int i = 0 ; i < 3; i++) {
long start = System.nanoTime();
try {
if (type == 0) {
ClientCalls.blockingUnaryCall(
channel, GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT, request);
} else if (type == 1) {
ClientCalls.futureUnaryCall(
channel.newCall(GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT), request).get();
}
long duration = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - start);
String show = "thread " + name + "\tcost " + duration + "µs";
System.out.println(show);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", 50051)
.usePlaintext(true).build();
GreeterGrpc.newBlockingStub(channel).sayHello(request);
for (int i = 0; i < 100; i++) {
QueryThread th = new QueryThread(String.valueOf(i), channel);
th.start();
Thread.sleep(100);
}
}
}
It results in things like:
thread 0 cost 7718µs
thread 0 cost 1816µs
thread 0 cost 1743µs
thread 1 cost 7718µs
thread 1 cost 1850µs
thread 1 cost 1852µs
thread 2 cost 7914µs
thread 2 cost 1879µs
thread 2 cost 1729µs
thread 3 cost 8246µs
thread 3 cost 2064µs
thread 3 cost 1875µs
...
thread 96 cost 1507µs
thread 96 cost 1127µs
thread 96 cost 1066µs
thread 97 cost 1815µs
thread 97 cost 1169µs
thread 97 cost 1038µs
thread 98 cost 1712µs
thread 98 cost 960µs
thread 98 cost 902µs
thread 99 cost 1703µs
thread 99 cost 1105µs
thread 99 cost 1234µs
@ejona86 I find out that the main time consumer of this thread warm up is the allocate method in NettyWritableBufferAllocator. Specifically, it's the invoking of allocator.buffer() inside this method that is responsible for the delay.
We have the similar project when using TensorFlow Serving with Java gRPC clients. Is there any way to fix that? @ejona86
Please let us know any fix on this issue or workaround. @ejona86
I wonder if the lastest grpc java framework have fixed this problem ?
Any updates on this? @ejona86
Curious if this is similar to https://github.com/grpc/grpc-java/issues/2971
Any update on this? @ejona86
I'm also experiencing very similar issue. When using a gRPC channel and async stub to send data, the first few onNext(data)
takes a lot of time(~15ms), but afterwards it takes much less time (~2 to 3 ms).
I'm looking forward to make all of them take as little time as possible. Do you have any suggestion on that?
I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?
Thank you
I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?
@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.
I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?
@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.
@inkinworld thank you for sharing. Let me try that.
any reference implementation for constructing large dummy warmup message?
Make a message with a bytes
field, and then just stick a byte array of the size you want. The encoded protobuf will be a few bytes larger than that byte array.
Any reference implementation for constructing large dummy warmup message?