grpc-java channel warm up per thread

Hi, all. I'm doing an experiment with grpc and find out that the first usage of grpc in a thread costs much more time than the subsequent ones doing the same thing. I tested it on a 2-core laptop and 24-core machines, and the same phenomena occurred on all of them. So I'm wondering if I'm using grpc wrong or something in grpc-java could be improved to avoid such situation. Below is my code and part of the report.

import java.text.SimpleDateFormat;
import java.util.Date;

import me.gdbc.mocker.MockServer;
import me.proto.query.QueryProtocol.QueryRequest;
import me.proto.query.QueryProtocol.QueryResponse;

import me.proto.query.QueryServiceInterfaceGrpc.METHOD_QUERY;

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.netty.NettyChannelBuilder;

public class WarmUp {

public static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

public static QueryRequest request = QueryRequest.newBuilder().setQueryString("").build();
public static class QueryThread extends Thread {

    private String name;
    private ManagedChannel client;

    public QueryThread(String name, ManagedChannel client) {
        this.name = name;
        this.client = client;
    }

    public void run() {
        for (int i = 0 ; i < 300; i++) {
            long start = System.currentTimeMillis();
            try {
                //ClientCall<QueryRequest, QueryResponse> call =  client.newCall(METHOD_QUERY, CallOptions.DEFAULT);
                QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(client, METHOD_QUERY, CallOptions.DEFAULT, request);
                //QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(call, request);
                long duration = System.currentTimeMillis() - start;
                String show  = "" + ret + "\n" + df.format(new Date()) + "\n" + "cost " + duration + "ms in thread " + name;
                System.out.println(show);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }   
}

public static void main(String[] args) {
    int port = 9012;
    MockServer server = new MockServer(port);
    try {
        server.start();
        ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", port).usePlaintext(true).build();
        io.grpc.stub.ClientCalls.blockingUnaryCall(channel, METHOD_QUERY, CallOptions.DEFAULT, request);
        for (int i = 0; i < 4; i++) {
            QueryThread th = new QueryThread(String.valueOf(i), channel);
            th.start();

        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}

}

May 01, 2016 2:28:09 PM me.gdbc.mocker.MockServer start
INFO: Server started, listenning on 9012
May 01, 2016 2:28:09 PM io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@326de728] Created with target 127.0.0.1:9012
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 76ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 67ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 68ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 66ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 4ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}

May 01 '16 06:05 LeeYeeze

A few things:

You should measure this kind of time with System.nanotime().
The JVM takes a large number of iterations before it will compile JIT'd code. This can be as many as 10K iterations, which it doesn't look like your test hits.
Please run the benchmark code (in benchmarks/...) to get an idea of what a reasonable latency is.

May 02 '16 15:05 carl-mastrangelo

Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.

May 02 '16 16:05 LeeYeeze

Could it be because the actual connection is only made when the first call comes through, and subsequent calls don't pay the connection setup penalty?

On Mon, May 2, 2016 at 9:07 AM, Yize Li [email protected] wrote:

Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/grpc/grpc-java/issues/1758#issuecomment-216278484

May 02 '16 16:05 makdharma

@makdharma, that seems very likely.

@LeeYeeze, the point is that you aren't comparing "first RPC on a thread" and instead saying "gRPC is slow to start". Since each thread is started at the same time I'd expect them to each have similar performance. In order to show it was creation a new thread that was slow you'd need to create new threads throughout the duration of the test.

May 02 '16 16:05 ejona86

@makdharma @ejona86 I invoke a blockingUnaryCall before I start all the threads, so the connection is already setup. I also test it with threads separated in time throughout the duration of the test, almost the same except that without contention the first call cost less time but still slow. My laptop is not at hand right now, so I will paste my report later here.

May 02 '16 16:05 LeeYeeze

@LeeYeeze, I've reproduced what you see. I'm on Linux. Swapping my CPU governor from ondemand to performance didn't change behavior much.

I modified your code, hitting the hello-world-server:

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.examples.helloworld.GreeterGrpc;
import io.grpc.examples.helloworld.HelloReply;
import io.grpc.examples.helloworld.HelloRequest;
import io.grpc.netty.NettyChannelBuilder;
import io.grpc.stub.ClientCalls;

import java.util.Date;
import java.util.concurrent.TimeUnit;

public class ThreadWarmUp {
  private static final HelloRequest request = HelloRequest.newBuilder().setName("world").build();

  public static class QueryThread extends Thread {
    private String name;
    private ManagedChannel channel;

    public QueryThread(String name, ManagedChannel channel) {
      this.name = name;
      this.channel = channel;
    }

    public void run() {
      final int type = 0;
      for (int i = 0 ; i < 3; i++) {
        long start = System.nanoTime();
        try {
          if (type == 0) {
            ClientCalls.blockingUnaryCall(
                channel, GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT, request);
          } else if (type == 1) {
            ClientCalls.futureUnaryCall(
                channel.newCall(GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT), request).get();
          }
          long duration = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - start);
          String show  = "thread " + name + "\tcost " + duration + "µs";
          System.out.println(show);
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    }
  }

  public static void main(String[] args) throws Exception {
    ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", 50051)
        .usePlaintext(true).build();
    GreeterGrpc.newBlockingStub(channel).sayHello(request);
    for (int i = 0; i < 100; i++) {
      QueryThread th = new QueryThread(String.valueOf(i), channel);
      th.start();
      Thread.sleep(100);
    }
  }
}

It results in things like:

thread 0    cost 7718µs
thread 0    cost 1816µs
thread 0    cost 1743µs
thread 1    cost 7718µs
thread 1    cost 1850µs
thread 1    cost 1852µs
thread 2    cost 7914µs
thread 2    cost 1879µs
thread 2    cost 1729µs
thread 3    cost 8246µs
thread 3    cost 2064µs
thread 3    cost 1875µs
...
thread 96   cost 1507µs
thread 96   cost 1127µs
thread 96   cost 1066µs
thread 97   cost 1815µs
thread 97   cost 1169µs
thread 97   cost 1038µs
thread 98   cost 1712µs
thread 98   cost 960µs
thread 98   cost 902µs
thread 99   cost 1703µs
thread 99   cost 1105µs
thread 99   cost 1234µs

May 02 '16 18:05 ejona86

@ejona86 I find out that the main time consumer of this thread warm up is the allocate method in NettyWritableBufferAllocator. Specifically, it's the invoking of allocator.buffer() inside this method that is responsible for the delay.

May 04 '16 13:05 LeeYeeze

We have the similar project when using TensorFlow Serving with Java gRPC clients. Is there any way to fix that? @ejona86

Apr 28 '17 01:04 tobegit3hub

Please let us know any fix on this issue or workaround. @ejona86

Oct 04 '19 09:10 ferrari6666

I wonder if the lastest grpc java framework have fixed this problem ?

Oct 16 '20 05:10 zxiang179

Any updates on this? @ejona86

Curious if this is similar to https://github.com/grpc/grpc-java/issues/2971

Feb 20 '21 01:02 gary-lo

Any update on this? @ejona86

I'm also experiencing very similar issue. When using a gRPC channel and async stub to send data, the first few onNext(data) takes a lot of time(~15ms), but afterwards it takes much less time (~2 to 3 ms).

I'm looking forward to make all of them take as little time as possible. Do you have any suggestion on that?

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

Thank you

Nov 03 '21 23:11 zzxgzgz

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.

Feb 24 '22 06:02 inkinworld

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.

@inkinworld thank you for sharing. Let me try that.

Feb 24 '22 17:02 zzxgzgz

any reference implementation for constructing large dummy warmup message?

Mar 29 '22 19:03 josephykwang

Make a message with a bytes field, and then just stick a byte array of the size you want. The encoded protobuf will be a few bytes larger than that byte array.

Mar 29 '22 21:03 ejona86

Any reference implementation for constructing large dummy warmup message?

Jul 13 '23 09:07 313hemant313

grpc-java grpc-java copied to clipboard

channel warm up per thread

grpc-java
grpc-java copied to clipboard