helidon icon indicating copy to clipboard operation
helidon copied to clipboard

gRPC Client Implementation

Open spericas opened this issue 1 year ago • 6 comments

Description

New gRPC Client API and implementation for SE. Issue #7060.

spericas avatar Feb 26 '24 18:02 spericas

After plugging the client into Coherence I am seeing a lot of severe level log messages due to timeouts. We use a bidirectional channel for Coherence events. The client makes the bi-di request which creates the bi-di channel, then the client can send messages up that channel to the server and the server can periodically send event messages back down the channel. It appears there is a 100ms timeout in the code that constantly logs the error. But, even though the error is logged a lot, everything appears to work, as the tests for events all pass.

The code is in GrpcClientCall line 152

                    try {
                        frameData = clientStream().readOne(WAIT_TIME_MILLIS_DURATION);
                    } catch (StreamTimeoutException e) {
                        socket().log(LOGGER, ERROR, "[Reading thread] read timeout");
                        continue;
                    }

thegridman avatar Apr 22 '24 15:04 thegridman

After plugging the client into Coherence I am seeing a lot of severe level log messages due to timeouts. We use a bidirectional channel for Coherence events. The client makes the bi-di request which creates the bi-di channel, then the client can send messages up that channel to the server and the server can periodically send event messages back down the channel. It appears there is a 100ms timeout in the code that constantly logs the error. But, even though the error is logged a lot, everything appears to work, as the tests for events all pass.

The code is in GrpcClientCall line 152

                    try {
                        frameData = clientStream().readOne(WAIT_TIME_MILLIS_DURATION);
                    } catch (StreamTimeoutException e) {
                        socket().log(LOGGER, ERROR, "[Reading thread] read timeout");
                        continue;
                    }

That timeout was too aggressive and has now been set to 2 seconds. We shall make this configurable eventually.

spericas avatar Apr 23 '24 18:04 spericas

That timeout was too aggressive and has now been set to 2 seconds. We shall make this configurable eventually.

Will there be an option to configure it to never time out? We use a bi-directional channel to receive events related to a cache lifecycle on the client. There may be zero events from the server for the entire lifetime of the client application.

thegridman avatar Apr 24 '24 09:04 thegridman

After running the latest build through our gRPC client tests we have the following issues.

  • Still seeing constant timeout messages (which you mentioned above).
  • Large Protobuf messages do not work, it appears they cannot be read on the server.
  • The fix for streams that do not send a response and just close the observer seems to work now. But what is not working is where the server throws an exception, this seems to just get lost and the call never ends on the client. To test this I added a method that specifically throws an exception and this fails.

For the latest issues I updated the GrpcStubTest to add some reproducers. I have attached a patch file I exported from IntelliJ that you should be able to apply and reproduce the issues. All three additional tests will just hang as the client never completes the request. gRPC_Tests.patch

thegridman avatar Apr 24 '24 09:04 thegridman

I have been running our gRPC client tests using a build from this PR. The last time I ran them a few weeks ago they were ok. Now I have a failure in our fail-over tests.

In this scenario the client has a bidirectional stream open to the server when the server is killed. When running the tests with Netty the client's StreamObserver onError method is called with a StatusRuntimeException with a Status of Unavailable. When using Helidon this does not happen, it appears that Helidon tries to reconnect to the server the next time a messages is sent up the stream by the client. This does not work, in our case because the server is gone and we need to fail over to a new server on another port. Helidon does not allow us to configure multiple server addresses, so this constant reconnect attempt fails until the client eventually gets a timeout exception.

We need the clients StreamObserver onError to be called, because basically that stream is dead. In our case we rely on state on the server that was initialised by previous messages, we cannot have a bidirectional stream transparently fail over behind our backs.

In our client after the stream receives an error then the our client code can deal with reconnection. In our case we look up the new set of server endpoints and have to create a new Helidon client as the previous client only knows the single address we gave it previously. Obviously if our server was behind a load balancer that would be OK, but in a lot of cases there is no LB. With ManagedChannel in grpc-java there are various ways to pass multiple addresses or to do address lookup whenever the channel is about to connect, but Helidon does not have this. We have been able to work around this in our Helidon client code but it relies on the bidirectional (and presumably any other "stream" type calls) to be properly killed if the server disconnects.

thegridman avatar May 17 '24 11:05 thegridman

Further to the above, I copied one of the bidirectional tests in GrpcTest and modified it to stop the WebServer between two calls to send a message. At no time does my StreamObserver receive an error, even when sending more messages, they presumably fail and I see a sack trace in the log. I could fix this by changing GrpcClientCall line 128 that currently looks like this

            } catch (Throwable e) {
                socket().log(LOGGER, ERROR, e.getMessage(), e);
            }
            socket().log(LOGGER, DEBUG, "[Writing thread] exiting");

so that it calls the response listener like this:

            } catch (Throwable e) {
                socket().log(LOGGER, ERROR, e.getMessage(), e);
                responseListener().onClose(Status.UNKNOWN.withDescription(e.getMessage()).withCause(e), EMPTY_METADATA);
            }
            socket().log(LOGGER, DEBUG, "[Writing thread] exiting");

At least now my client StreamObserver onError is called when the send fails. But I still have not worked out how to call it when the socket closes.

thegridman avatar May 17 '24 12:05 thegridman

After the latest changes I have run our 5,500 gRPC client tests using the Helidon client from this PR and they all pass (including the fail-over tests).

thegridman avatar May 22 '24 08:05 thegridman