grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

blockingUnaryCall withtDeadlineAfter RPC request hangs forever

Open jianliu opened this issue 6 months ago • 1 comments

What version of gRPC-Java are you using?

1.50.0 with grpc-all and shaded it

What is your environment?

Linux Centos7,jdk8

What did you expect to see?

The call of blockingUnaryCall witht DeadlineAfter should not block forever

What did you see instead?

request with deadline after 30 secnods thread is hanging forever.

the code of connection:

ManagedChannelBuilder<?> builder = ManagedChannelBuilder.forAddress(connID.getHost(), connID.getPort())
        .usePlaintext()
        .maxInboundMessageSize(GrpcUtil.MAX_INBOUND_MESSAGE_SIZE)
        .keepAliveTime(2, TimeUnit.MINUTES)
        .keepAliveWithoutCalls(true);

ManagedChannel channel = builder.build();

The code will automatically disconnect and reconnect every 10 minutes.

thd code of request:

PolarisGRPCGrpc.PolarisGRPCBlockingStub stub = PolarisGRPCGrpc.newBlockingStub(connection.getChannel())
        .withDeadlineAfter(30000, TimeUnit.MILLISECONDS);
GrpcUtil.attachRequestHeader(stub, GrpcUtil.nextHeartbeatReqId());
ResponseProto.Response heartbeatResponse = stub.heartbeat(buildHeartbeatRequest(req));
GrpcUtil.checkResponse(heartbeatResponse);

the hangs thread stack:

"polaris-async-register-1" #463 prio=5 os_prio=0 cpu=107472.49ms elapsed=613744.43s tid=0x00007fd3818d8000 nid=0x31c waiting on condition  [0x00007fd38519f000]
  java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000006b241b5b0> (a shade.polaris.io.grpc.stub.ClientCalls$ThreadlessExecutor)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at shade.polaris.io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:748)
	at shade.polaris.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:157)
	at com.tencent.polaris.specification.api.v1.service.manage.PolarisGRPCGrpc$PolarisGRPCBlockingStub.heartbeat(PolarisGRPCGrpc.java:432)
	at com.tencent.polaris.plugins.connector.grpc.GrpcConnector.heartbeat(GrpcConnector.java:547)
	at com.tencent.polaris.discovery.client.api.DefaultProviderAPI.heartbeat(DefaultProviderAPI.java:164)
	at com.tencent.polaris.discovery.client.api.DefaultProviderAPI$$Lambda$1295/0x0000000801249840.doHeartbeat(Unknown Source)
	at com.tencent.polaris.discovery.client.flow.RegisterFlow.doRunHeartbeat(RegisterFlow.java:80)
	at com.tencent.polaris.discovery.client.flow.RegisterFlow.lambda$registerInstance$0(RegisterFlow.java:66)
	at com.tencent.polaris.discovery.client.flow.RegisterFlow$$Lambda$1298/0x000000080124fc40.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
	at java.util.concurrent.FutureTask.runAndReset([email protected]/FutureTask.java:305)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/ScheduledThreadPoolExecutor.java:305)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:829)

PS: I looked at https://github.com/grpc/grpc-java/issues/10838 and https://github.com/grpc/grpc-java/issues/10336. They used retry configuration, but I did not make any retry related configuration.

Steps to reproduce the bug

I tried disconnecting the network, packet loss, etc. but couldn't reproduce the problem.

jianliu avatar Jun 26 '25 08:06 jianliu

Looking forward to your reply. As I noted, is there any guarantee at the JVM level (like responseFuture .get(timeout, unit)) to reduce the risk of hangs in extreme network conditions? When I looked at the issue, I saw several similar issues.

The code of cause hangs is in io.grpc.stub.ClientCalls

public static <ReqT, RespT> RespT blockingUnaryCall(
    Channel channel, MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, ReqT req) {
  ThreadlessExecutor executor = new ThreadlessExecutor();
  boolean interrupt = false;
  ClientCall<ReqT, RespT> call = channel.newCall(method,
      callOptions.withOption(ClientCalls.STUB_TYPE_OPTION, StubType.BLOCKING)
          .withExecutor(executor));
  try {
    ListenableFuture<RespT> responseFuture = futureUnaryCall(call, req);
    // My Note:  
    // The problem lies here. 
    // In order to ensure that the blocking call will not block the calling thread,
    //  can we first determine whether there is a deadline? 
    // If so, change it to responseFuture.get(timeout, unit)
    while (!responseFuture.isDone()) {
      try {
        executor.waitAndDrain();
      } catch (InterruptedException e) {
        interrupt = true;
        call.cancel("Thread interrupted", e);
        // Now wait for onClose() to be called, so interceptors can clean up
      }
    }
    executor.shutdown();
    return getUnchecked(responseFuture);
  } catch (RuntimeException e) {
    // Something very bad happened. All bets are off; it may be dangerous to wait for onClose().
    throw cancelThrow(call, e);
  } catch (Error e) {
    // Something very bad happened. All bets are off; it may be dangerous to wait for onClose().
    throw cancelThrow(call, e);
  } finally {
    if (interrupt) {
      Thread.currentThread().interrupt();
    }
  }
}

jianliu avatar Jun 26 '25 08:06 jianliu

We recently had an issue where ClientCalls.blockingUnaryCall would get stuck waiting if the message framer threw an exception if message size exceeded the maximum outbound message size set on the stub. Can you check if that is the case when the hang occurs?

kannanjgithub avatar Jun 30 '25 07:06 kannanjgithub

Thank you very much for your reply.The scenario where I use blockingUnaryCall is to heartbeat in microservice. I am very sure that the registered data is less than 5kb. There are 10 nodes in total, and the problem occurs in one of them,and it appears after thousands of calls.

jianliu avatar Jun 30 '25 12:06 jianliu

What you mentioned in the above code via comments above while (!responseFuture.isDone()) may indeed be the fix when the runnable for the pending call dies abruptly. The handling for deadline exceeding already exists in the call implementation so there must be some bug causing the runnable executing the pending call to die abruptly. While the reason was known to be the message framer exception in the issue I linked, in your case it is not clear what is causing it. Is the problem happening only when you use withDeadlineAfter and not otherwise?

kannanjgithub avatar Jul 02 '25 10:07 kannanjgithub

I noticed your version of gRPC-Java is old. Can you try with the latest released version and see if it still is an issue there?

kannanjgithub avatar Jul 03 '25 04:07 kannanjgithub

What you mentioned in the above code via comments above while (!responseFuture.isDone()) may indeed be the fix when the runnable for the pending call dies abruptly. The handling for deadline exceeding already exists in the call implementation so there must be some bug causing the runnable executing the pending call to die abruptly. While the reason was known to be the message framer exception in the issue I linked, in your case it is not clear what is causing it. Is the problem happening only when you use withDeadlineAfter and not otherwise?

In fact, I used the method without withDeadlineAfter at the beginning. After the thread stuck problem occurred, I added withDeadlineAfter and keepalive mechanism, but there is still such a problem, but overall I feel that the probability has become smaller.

jianliu avatar Jul 03 '25 06:07 jianliu

I noticed your version of gRPC-Java is old. Can you try with the latest released version and see if it still is an issue there?

It is a bit difficult to upgrade the go sdk version. We developed the register-center component based on grpc and provided it to the business side in the form of sdk. We cannot make major changes to the sdk version. Our online environment has tens of thousands of nodes, and only 2~3 of them have problems.

jianliu avatar Jul 03 '25 07:07 jianliu

I don't know if this is precisely the same problem I am seeing, but the symptoms are equivalent.

I'm running 1.73.0. I am not limited to seeing hangs in unary, or blocking (async client-stream, future unary, and blocking server-stream all present). I've traced all of the open async stub client-streaming requests and can see that: They all have deadlines set, in the future when the call is made. The streams are waiting for ready to begin calling onNext, the registered onReady handler is never called, and the stream was never observably ready. Adding a ClientStreamTracer through the channel's build procedure and matching streams to the hung requests shows that they received streamCreated, but no other signals, including outboundHeaders, which I see next on all successful requests in other contexts. edit I should also mention that I can reproduce this reliably, it occurs in bazel (hand upgraded to 1.73.0) and would be happy to try anything to get around this.

I am trying to draw a line between the time when streamCreated is called to when outboundHeaders could have been invoked, to determine what could be stuck, without the deadline being applied, but it is extremely slow going through Channel -> Transport -> Stream -> Handler (I think that's the order), and will be trying to expand with additional callbacks into the StreamTracer to determine where the sticking point is.

werkt avatar Jul 08 '25 17:07 werkt

My problem at least stems from reaching this point with a null http2Stream, for all cases that are hung: https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyClientHandler.java#L742-L745 And never receiving any further notification on the stream. I am pretty sure this implies that the RST_STREAM netty is indicated to have sent/be sending, does not exist.

werkt avatar Jul 09 '25 19:07 werkt

https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyClientHandler.java#L726-L727 Do you mean that as long as the stream has not ended, it will hang forever after RST_STREAM occurs?

jianliu avatar Jul 14 '25 08:07 jianliu

https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyClientHandler.java#L726-L727 Do you mean that as long as the stream has not ended, it will hang forever after RST_STREAM occurs?

That's a characterization that I can't assert happened - I don't know that RST_STREAM occurred, because I have difficulty seeing what happens at the netty layers on the channel/stream. All I can prove here through observation is:

  • A grpc stub call was initiated to reach registration of this channelFuture callback.
  • This channelFuture callback was invoked.
  • Requesting the streamId from the connection() returns null.
  • No further work is done (client outbound headers are not sent) because the http2stream is null.
  • Specifying that the promise is successful results in the grpc call never receiving an onReady, response, completed, or error callback.

This is indicated in my comment in the PR. This callback relies on something else in netty to cancel the call in the null stream case, and it is not doing so.

werkt avatar Jul 14 '25 11:07 werkt

@werkt made a great discovery in #12207, so now we check whether y'all were seeing the same thing.

@jianliu, when you saw this, was the connection to a server across the internet or a slow/unreliable internet connection. If you are seeing that between two computers in a single datacenter then it is almost certainly different than what is being fixed in #12207. I see @jianliu is suffering from a heartbeat failing, which I assume is a pretty small message, so we'd really need some sort of networking slowness/glitch at just the wrong time to trigger the same bug.

ejona86 avatar Jul 15 '25 18:07 ejona86