grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Throwing an exception from some StreamTracer methods hangs the RPC

Open groakley opened this issue 5 years ago • 10 comments

What version of gRPC-Java are you using?

Head

What is your environment?

Android

What did you expect to see?

If a client-implemented ClientStreamTracer method throws an exception, the RPC should fail.

What did you see instead?

The RPC never completes (blocks forever or the future never completes, depending on stub type).

Steps to reproduce the bug

I was able to reproduce this bug with both the InProcessChannel and Cronet transport implementations.

From grpc's GrpcServerRuleTest:


@Test
public void testTracer() throws Exception {
  TestServiceImpl testService = new TestServiceImpl();
  
  grpcServerRule1.getServiceRegistry().addService(testService);

  SimpleServiceGrpc.SimpleServiceBlockingStub stub =
      SimpleServiceGrpc.newBlockingStub(grpcServerRule1.getChannel());

  SimpleRequest request = SimpleRequest.getDefaultInstance();

  SimpleResponse resp =
      stub.withInterceptors(
              new ClientInterceptor() {
                @Override
                public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
                    MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
                  return next.newCall(
                      method,
                      callOptions.withStreamTracerFactory(
                          new ClientStreamTracer.Factory() {
                            @Override
                            public ClientStreamTracer newClientStreamTracer(
                                ClientStreamTracer.StreamInfo info, Metadata headers) {
                              return new ClientStreamTracer() {
                                @Override
                                public void streamClosed(Status status) {
                                  throw new RuntimeException();
                                }
                              };
                            }
                          }));
                }
              })
          .unaryRpc(request);
  assertThat(resp).isEqualTo(SimpleResponse.getDefaultInstance());
}

groakley avatar Aug 27 '20 21:08 groakley

Spoke with @groakley and he mentioned that @wanyingd1996 might be interested in taking this on. But apparently I can't directly assign them... (previous contributor in https://github.com/grpc/grpc-java/pull/7242)

ericgribkoff avatar Aug 31 '20 22:08 ericgribkoff

I reproduced the bug using InProcessChannel, and the trace shows the problem is inside of notifyClientClose method in InProcessTransport.java. The streamClosed method is called by clientStream.statsTraceCtx.streamClosed(clientStatus);, and if this fails, notifyClientClose will fail, then client will never get notified to be closed. Same problem can be reproduced by throwing RuntTimeException from any methods that is called by notifyClientClose. Unless there is other way to notify client, It seems the only way to avoid hanging RPC in this situation is to catch and do not throw in these StreamTracer related methods.

wanyingd1996 avatar Sep 03 '20 17:09 wanyingd1996

At a minimum, we'd want to make sure that any client-implemented method that throws doesn't hang the stub or prevent resources from being released.

Looking at InProcessTransport, one way to achieve that might be to modify

private void notifyClientClose(Status status, Metadata trailers) {
  Status clientStatus = cleanStatus(status, includeCauseWithStatus);
  synchronized (this) {
    if (closed) {
      return;
    }
    if (clientReceiveQueue.isEmpty()) {
      closed = true;
      clientStream.statsTraceCtx.clientInboundTrailers(trailers);
      clientStream.statsTraceCtx.streamClosed(clientStatus);
      clientStreamListener.closed(clientStatus, trailers);
    } else {
      clientNotifyStatus = clientStatus;
      clientNotifyTrailers = trailers;
    }
  }
  streamClosed();
}

to be

private void notifyClientClose(Status status, Metadata trailers) {
  Status clientStatus = cleanStatus(status, includeCauseWithStatus);
  synchronized (this) {
    if (closed) {
      return;
    }
    if (clientReceiveQueue.isEmpty()) {
      closed = true;
      clientStatus = runUpdatingStatusOnFailure(clientStatus, () -> {clientStream.statsTraceCtx.clientInboundTrailers(trailers);});
      clientStatus = runUpdatingStatusOnFailure(clientStatus, () -> {clientStream.statsTraceCtx.streamClosed(clientStatus);});
      clientNotifyStatus = runUpdatingStatusOnFailure(clientStatus, () -> {clientStreamListener.closed(clientStatus, trailers);});
    } else {
      clientNotifyStatus = clientStatus;
      clientNotifyTrailers = trailers;
    }
  }

  streamClosed();
}

private Status runUpdatingStatusOnFailure(Status status, Runnable command) {
  try {
    command.run();
  } catch (Throwable t) {
    return cleanStatus(Status.fromThrowable(t), includeCauseWithStatus);
  }
  return status;
}

groakley avatar Sep 03 '20 23:09 groakley

@ericgribkoff , do you think we should go with the fix suggested by @groakley ?

wanyingd1996 avatar Sep 08 '20 17:09 wanyingd1996

At a minimum, we'd want to make sure that any client-implemented method that throws doesn't hang the stub or prevent resources from being released.

Yes. We also will want to update the javadoc for StreamTracer to say that implementations must not throw.

Regarding the runUpdatingStatusOnFailure approach, that would fix the hanging but we can/should actually be more stringent here about dealing with an exception if one is thrown: we should at the least shutdown the transport if any StreamTracer throws (moving the entire channel into a panic state might be even better, but might be tricky to actually accomplish).

@wanyingd1996 Wrapping the calls to statsTraceCtx methods in try/catch blocks should work (I think bad things can happen if an exception is thrown by a stream tracer even outside of the notifyClientClose call?). There are 18 calls to statsTraceCtx methods in InProcessTransport, plus some more in AbstractClientStream and a few in Netty/OkHttp specific code, so that approach might be somewhat unwieldy...catching any exception somewhere higher up in the call stack (whether it comes from a stream tracer or a bug in grpc code itself) and responding by shutting down the transport or panic'ing the channel with a clear error message could be a cleaner approach to this, but I haven't looked into details to see where/how this could best be done.

ericgribkoff avatar Sep 09 '20 04:09 ericgribkoff

@ericgribkoff I think catching the exception in StatsTraceContext instead of in notifyClientClose would be clear, but it seems I can't shutdown the transport in StatsTraceContext. I am fairly new to grpc, do you have any suggestions?

wanyingd1996 avatar Sep 09 '20 19:09 wanyingd1996

@wanyingd1996 Are you still interested in working on this issue?

larry-safran avatar Jun 22 '23 17:06 larry-safran

@wanyingd1996 Are you still interested in working on this issue?

Hi, Larry, I'm not working on grpc anymore, and I don't have the bandwidth for the bug now. feel free to take this. Thanks!

wanyingd1996 avatar Jun 22 '23 17:06 wanyingd1996

Started analysis, going through the repro steps.

SreeramdasLavanya avatar Oct 03 '24 13:10 SreeramdasLavanya

Here is my understanding of the issue. The suggestion is that the transport should be shutdown when any of the Stream tracer methods throw an exception, to prevent the subchannel from hanging. It appears that since StatsTraceContext wraps the calls to the StreamTracers the exception handling can be in this class. StatsTraceContext is used both on client side and server side, for the respective transports, and it calls stream tracer methods for collecting trace information about the respective transport. Only the ServerTransport has a shutdown method, not the ClientTransport, so the shutting down the transport only applies when a server side stream tracer throws an exception (the exception being for InProcessTransport). For other transports what should the StreamTracerContext do after catching an exception from a client StreamTracer? When creating StatsTraceContext on the server side from a Transport impl we could add the ServerTransport.this as an argument and store it as a field in StatsTraceContext, to call shutdown on it when any of the server side StreamTracer methods throw an exception. The example code given in this issue uses InProcessTransport and client StreamTracers. Since the process is common to client and server, the server close sequence itself calls the client stream tracers

streamClosed:170, StatsTraceContext (io.grpc.internal)
lambda$notifyClientClose$6:616, InProcessTransport$InProcessStream$InProcessServerStream (io.grpc.inprocess)
run:-1, InProcessTransport$InProcessStream$InProcessServerStream$$Lambda/0x00007f631f163248 (io.grpc.inprocess)
runUpdatingStatusOnFailure:633, InProcessTransport$InProcessStream$InProcessServerStream (io.grpc.inprocess)
notifyClientClose:615, InProcessTransport$InProcessStream$InProcessServerStream (io.grpc.inprocess)
close:601, InProcessTransport$InProcessStream$InProcessServerStream (io.grpc.inprocess)
closeInternal:227, ServerCallImpl (io.grpc.internal)
close:213, ServerCallImpl (io.grpc.internal)
onCompleted:395, ServerCalls$ServerCallStreamObserverImpl (io.grpc.stub)
unaryRpc:255, GrpcServerRuleTest$TestServiceImpl (io.grpc.testing)
invoke:410, SimpleServiceGrpc$MethodHandlers (io.grpc.testing.protobuf)
onHalfClose:182, ServerCalls$UnaryServerCallHandler$UnaryServerCallListener (io.grpc.stub)
halfClosed:356, ServerCallImpl$ServerStreamListenerImpl (io.grpc.internal)
runInContext:861, ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed (io.grpc.internal)

and only for InProcessTransport we would shutdown the transport when a client StreamTracer method throws. For InProcessTransport we may store the server transport in the StatsTraceContext created for the client side also.

@ericgribkoff please confirm.

kannanjgithub avatar Oct 16 '24 09:10 kannanjgithub