grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

onError callback not triggered by grpc after channel idle timeout expires

Open lmcdasi opened this issue 1 year ago • 7 comments

I'm encoutering an issue where the onError callback is not triggered after a 3'rd retry and I do not know what is the cause ...

When an onError is executed the code attempts to restart a stream and that seems to work 2 times but then the onError is no longer triggered & I do not understand the reason.

2022-07-09T14:32:48.015Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} DEBUG: [c.n.m.c.ConnectorSessionHandler] _Processing onError_
2022-07-09T14:32:48.015Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} DEBUG: [c.n.m.c.ConnectorSessionHandler] Cancelling startSessionVerificationScheduledFuture
2022-07-09T14:32:48.015Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} INFO: [c.n.m.c.ConnectorSessionHandler] _Restarting session in 10s_
2022-07-09T14:32:58.016Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} INFO: [c.n.m.c.ConnectorSessionHandler] _Restarting session with restart count 2, max restart count 5_
2022-07-09T14:32:58.016Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} INFO: [c.n.m.c.c.ConnectorServiceClient] _Starting session 249b936d-e01c-413b-aa38-dd951ee70794_
2022-07-09T14:32:58.016Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} DEBUG: [c.n.m.c.ConnectorSessionHandler] startSessionVerificationScheduledFuture scheduled in 10000ms
2022-07-09T14:32:58.017Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} INFO: [c.n.m.c.ConnectorSessionHandler] ====> Sending ConnectorMessage start_session_request {_  connector_session_id: "249b936d-e01c-413b-aa38-dd951ee70794"_}_
2022-07-09T14:33:01.576Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} ERROR: [c.n.m.c.c.ConnectorManagerStreamObserver] Exception during session with status Status{code=UNAVAILABLE, description=Network closed for unknown reason, cause=null}.
io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason
        at io.grpc.Status.asRuntimeException(Status.java:535)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at net.devh.boot.grpc.client.metric.MetricCollectingClientCallListener.onClose(MetricCollectingClientCallListener.java:59)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.opentracing.contrib.grpc.TracingClientInterceptor$1$2.onClose(TracingClientInterceptor.java:228)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-09T14:33:01.576Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} DEBUG: [c.n.m.c.ConnectorSessionHandler] _Processing onError_
2022-07-09T14:33:01.577Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} DEBUG: [c.n.m.c.ConnectorSessionHandler] Cancelling startSessionVerificationScheduledFuture
2022-07-09T14:33:01.577Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} INFO: [c.n.m.c.ConnectorSessionHandler] _Restarting session in 10s_
2022-07-09T14:33:11.578Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} INFO: [c.n.m.c.ConnectorSessionHandler] _Restarting session with restart count 3, max restart count 5_
2022-07-09T14:33:11.578Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} INFO: [c.n.m.c.c.ConnectorServiceClient] _Starting session 249b936d-e01c-413b-aa38-dd951ee70794_
2022-07-09T14:33:11.579Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} DEBUG: [c.n.m.c.ConnectorSessionHandler] startSessionVerificationScheduledFuture scheduled in 10000ms
2022-07-09T14:33:11.579Z <> {session=,trace=,span=,client=,thread=grpc-thread--0} INFO: [c.n.m.c.ConnectorSessionHandler] ====> Sending ConnectorMessage start_session_request {_  connector_session_id: "249b936d-e01c-413b-aa38-dd951ee70794"_}_
2022-07-09T14:33:14.361Z <> {session=,trace=,span=,client=,thread=grpc-thread--7} DEBUG: [i.g.n.shaded.io.grpc.netty.Utils] Using custom allocator: forceHeapBuffer=false, defaultPreferDirect=true
2022-07-09T14:33:16.399Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0xcd20a2c6, L:/172.19.0.2:59784 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND SETTINGS: ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2022-07-09T14:33:16.401Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0xcd20a2c6, L:/172.19.0.2:59784 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2022-07-09T14:33:19.349Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-7} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] _Network channel being closed by the application_.
2022-07-09T14:33:19.350Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-7} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0x2b4c87ba, L:/172.19.0.2:59778 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND GO_AWAY: lastStreamId=0 errorCode=0 length=0 bytes=
2022-07-09T14:33:19.353Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-7} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] Network channel is closed

onError is no longer triggered. Instead I need to relly on a separate future task that verifies if the session has been restored or not.

2022-07-09T14:33:21.579Z <> {session=,trace=,span=,client=,thread=start-session-status-0} INFO: [c.n.m.c.ConnectorSessionHandler] Restarting session in 10s
2022-07-09T14:33:31.579Z <> {session=,trace=,span=,client=,thread=start-session-status-0} INFO: [c.n.m.c.ConnectorSessionHandler] Restarting session with restart count 4, max restart count 5

Any retry onwards will no longer trigger the onError. If I restore the connection during that, it will be able to pick up ....

I do not get why the onError stops working ...

lmcdasi avatar Jul 09 '22 14:07 lmcdasi

I have changed my code to handle a bit differently the stream re-creation & I also added the keep alive.

Again at the same 3rd iteration - this time the keep alive is reacting and triggers the onError.

While I understand the importance of the keep alive - I do not think that there is an inconsistent way of handling this.

How come 2 iterations I did not had to rely on the keep alive but then suddenly on the 3rd I do?

2022-07-09T16:12:19.600Z <> {session=,trace=,span=,client=,thread=restart-session-0} INFO: [c.n.m.c.ConnectorSessionHandler] Restarting session in 10s
2022-07-09T16:12:29.600Z <> {session=,trace=,span=,client=,thread=restart-session-0} INFO: [c.n.m.c.ConnectorSessionHandler] Restarting session with restart count 3, max restart count 5
2022-07-09T16:12:29.601Z <> {session=,trace=,span=,client=,thread=restart-session-0} INFO: [c.n.m.c.c.ConnectorServiceClient] Starting session c1f2beb2-8f1d-41f7-80bf-b4b9c9a467df
2022-07-09T16:12:29.601Z <> {session=,trace=,span=,client=,thread=restart-session-0} DEBUG: [c.n.m.c.ConnectorSessionHandler] startSessionVerificationScheduledFuture scheduled in 10000ms
2022-07-09T16:12:29.601Z <> {session=,trace=,span=,client=,thread=restart-session-0} INFO: [c.n.m.c.ConnectorSessionHandler] ====> Sending ConnectorMessage start_session_request {_  connector_session_id: "c1f2beb2-8f1d-41f7-80bf-b4b9c9a467df"_}_
2022-07-09T16:12:31.348Z <> {session=,trace=,span=,client=,thread=grpc-thread--12} DEBUG: [i.g.n.shaded.io.grpc.netty.Utils] Using custom allocator: forceHeapBuffer=false, defaultPreferDirect=true
2022-07-09T16:12:31.868Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-1} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0xcb7b9f2d, L:/172.19.0.2:60772 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND PING: ack=false bytes=1111
2022-07-09T16:12:33.398Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0x5a2676e8, L:/172.19.0.2:60780 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND SETTINGS: ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2022-07-09T16:12:33.398Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0x5a2676e8, L:/172.19.0.2:60780 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2022-07-09T16:12:33.983Z <> {session=,trace=,span=,client=,thread=command-status-0} WARN: [c.n.m.c.ConnectorStatusHandler] No StatusCmd received within 40 - session restart triggered
2022-07-09T16:12:33.983Z <> {session=,trace=,span=,client=,thread=command-status-0} WARN: [c.n.m.c.ConnectorSessionHandler] RestartSession in progress from onError/onCompleted. Abort session restart
2022-07-09T16:12:35.798Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-1} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] Network channel being closed by the application.
2022-07-09T16:12:35.798Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-1} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0xcb7b9f2d, L:/172.19.0.2:60772 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND GO_AWAY: lastStreamId=0 errorCode=0 length=0 bytes=
2022-07-09T16:12:35.799Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-1} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] Network channel is closed
2022-07-09T16:13:01.348Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0x5a2676e8, L:/172.19.0.2:60780 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND PING: ack=false bytes=1111
2022-07-09T16:13:16.350Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.shaded.io.grpc.netty.Utils] Using custom allocator: forceHeapBuffer=false, defaultPreferDirect=true
2022-07-09T16:13:16.350Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] Network channel being closed by the application.
2022-07-09T16:13:16.351Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] [id: 0x5a2676e8, L:/172.19.0.2:60780 - R:192.168.1.86/192.168.1.86:8081] OUTBOUND GO_AWAY: lastStreamId=0 errorCode=0 length=0 bytes=
2022-07-09T16:13:16.351Z <> {session=,trace=,span=,client=,thread=grpc-default-worker-ELG-1-6} DEBUG: [i.g.n.s.i.g.n.NettyClientHandler] Network channel is closed
2022-07-09T16:13:18.383Z <> {session=,trace=,span=,client=,thread=grpc-thread--2} ERROR: [c.n.m.c.c.ConnectorManagerStreamObserver] Exception during session with status Status{code=UNAVAILABLE, description=Keepalive failed. The connection is likely gone, cause=null}.
**io.grpc.StatusRuntimeException: UNAVAILABLE: Keepalive failed. The connection is likely gone**
        at io.grpc.Status.asRuntimeException(Status.java:535)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at net.devh.boot.grpc.client.metric.MetricCollectingClientCallListener.onClose(MetricCollectingClientCallListener.java:59)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.opentracing.contrib.grpc.TracingClientInterceptor$1$2.onClose(TracingClientInterceptor.java:228)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
**2022-07-09T16:13:18.383Z <> {session=,trace=,span=,client=,thread=grpc-thread--2} DEBUG: [c.n.m.c.ConnectorSessionHandler] Processing onError**

lmcdasi avatar Jul 09 '22 16:07 lmcdasi

I think the error occurs because the grpc idle timeout value is 30s. And that would be close to 3 * ~10s retries ....

If I increase that value to 60s (> 5 * 10s) then the onError callback is triggered every time.

Is that the expected behavior from grpc? Is there a document that explains the correct setup that I missed?

lmcdasi avatar Jul 10 '22 11:07 lmcdasi

Shouldn't this call:

ClientCalls.asyncBidiStreamingCall(this.getChannel().newCall(ConnectorServiceGrpc.getStartSessionMethod(), this.getCallOptions()), var1);

reset the channel idle timeout if the connection has been lost and not established?

lmcdasi avatar Jul 11 '22 02:07 lmcdasi

You talk about "two calls to onError but missing a third." What errors do you see for those two calls? I see one log has a "Network closed for unknown reason", but that's just one error. What was the second one?

If you are seeing frequent "Network closed for unknown reason," and now you see keepalive helping, then you are fighting your network. A lower keepalive time may prevent the failures.

I think the error occurs because the grpc idle timeout value is 30s. And that would be close to 3 * ~10s retries ....

I'm not sure what setting you are talking about. gRPC idle timeout defaults to 30 minutes. gRPC idle timeout is defined based on how much time no RPCs have existed. If keepalive is causing the third RPC to fail, then that means there is a connection and the channel is not idle.

If I increase that value to 60s (> 5 * 10s) then the onError callback is triggered every time.

This really doesn't sound like gRPC idle timeout. What API in particular did you call?

ejona86 avatar Jul 14 '22 16:07 ejona86

Thank you for your answer.

It is a weird issue that I do not understand. I do not have set any keepalives - yet. I run all three 'parts' on the same VM: client, envoy, and the backend - collocated (not a real prod env but it's easy to test). I stop the envoy to simulate a connection loss.

The client grpc detects the error and triggers the onError callback. In the error callback the code triggers a new RPC with a new stream observer. And it works 3 times - until the channel idle timeout expires (a coincidence but increasing it to 60s 'solves' the issue). After that it get's stuck and never completes the retries nor it shutdowns.

The weird part is that if I move the client on a different VM, then the onError callback is always executed ... and I do not have that issue. Or at least I could not reproduce it so far.

The main difference is that on a single VM I do not get the: "Connection refused: /10.1.193.182:8081_Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused"

On the non-collocated logs, I do have an extra trace (that is missing on the collocated logs):

2022-07-14T17:53:32.609Z <> {session=,trace=,span=,client=,thread=grpc-thread--2} ERROR: [c.n.m.c.c.ConnectorManagerStreamObserver] Exception during session with status Status{code=UNAVAILABLE, description=io exception, cause=io
.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.1.193.182:8081_Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused_
at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)_     at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)_       at io.grpc.netty.shaded.io.netty.cha
nnel.unix.Socket.finishConnect(Socket.java:320)_        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)_ at io.grpc.netty.shaded.io.netty.cha
nnel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)_       at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.j
ava:567)_       at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)_    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)_     at io.grpc.n
etty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)_       at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)_      at io.grpc.n
etty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)_      at java.base/java.lang.Thread.run(Thread.java:829)_}.
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at io.grpc.Status.asRuntimeException(Status.java:535)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at net.devh.boot.grpc.client.metric.MetricCollectingClientCallListener.onClose(MetricCollectingClientCallListener.java:59)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.opentracing.contrib.grpc.TracingClientInterceptor$1$2.onClose(TracingClientInterceptor.java:228)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.1.193.182:8081
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
        at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
        at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
        at io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
        at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
        at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)

lmcdasi avatar Jul 14 '22 18:07 lmcdasi

I run all three 'parts' on the same VM

Uh.

The weird part is that if I move the client on a different VM, then the onError callback is always executed

Uhhhhh... I agree that makes little sense. Very, very strange.

If I increase that value to 60s (> 5 * 10s) then the onError callback is triggered every time.

What API did you call to change that?

ejona86 avatar Jul 14 '22 18:07 ejona86

I used the spring boot grpc project that expose the GrpcChannelConfigurer with which I can set the channelIdleTimeout.

public GrpcChannelConfigurer channelConfigurer() {
        return (channelBuilder, name) -> {
            LOGGER.info("Configuring grpc channel {}", name);
            if (channelIdleTimeoutSeconds > 0) {
                LOGGER.info("Setting grpc channel {} idle timeout to {} seconds", name, channelIdleTimeoutSeconds);
                channelBuilder.idleTimeout(channelIdleTimeoutSeconds, TimeUnit.SECONDS);
            }

lmcdasi avatar Jul 14 '22 18:07 lmcdasi

@lmcdasi, As ejona pointed out, you were shrinking the value from 30 minutes to 60 seconds rather than increasing it unless somewhere else is changing the default idleTimeout.

Did you ever get resolution to this or is it still an issue?

larry-safran avatar Jan 24 '23 02:01 larry-safran

We don't see anything more we can do on this and nobody else has reported this problem.

If anything more comes up either reply on this bug or open a new one.

larry-safran avatar Feb 06 '23 21:02 larry-safran