grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Understanding effective connection attempt timeout with heartbeat and idle settings

Open scwhittle opened this issue 2 weeks ago • 2 comments

I am trying to reduce latency in detecting failed channel connections and reattempting connections sooner.

I am using the cloud spanner client library code which uses google api gax library code to connect to the spanner service. There was a period where rpcs sent to a newly created channel all failed after their deadline with errors like: DEADLINE_EXCEEDED: CallOptions deadline exceeded after 29.996032979s. Name resolution delay 0.002361332 seconds. [closed=[], open=[[buffered_nanos=29996563395, waiting_for_connection]]]. The waiting_for_connection indicates that the connection had not yet completed connecting and that rpcs were instead buffered waiting for the connection before failing due to the deadline. This state lasted for 20 minutes until the process was restarted.

The following options are used when creating the channel:

keepAliveTime=2m
keepAliveTimeout=default
keepAliveWithoutCalls=false
idleTimeout=default (30m)

I see that keepAliveWithoutCalls being disabled means that keepalives are not sent when the transport is started. KeepAliveManager. This seems like the keepAliveTime thus won't help with detecting slow/stuck connection attempts. It seems then it may take up to the idleTimeout of 30 minutes to transfer the channel from CONNECTING->IDLE to detect the stuck CONNECTING state and reattempt connecting on the next call.

  • Are there other mechanisms that would detect and exit long CONNECTING state sooner with this configuration? I see that Netty has ChannelOptions.CONNECT_TIMEOUT_MILLIS but this doesn't appear to be set by grpc itself (or gax).
  • Though setting keepAliveWithoutCalls=true may help reduce the timeout to the keep-alive time it changes the behavior when the channel is not connecting. As a user since there are attempted calls it seems like keep-alives should be sent in this case. Is this a bug that should be addressed?

scwhittle avatar Dec 11 '25 09:12 scwhittle

Hi,

To give you the short answer first: This is not a bug. The behavior you are observing (stuck in CONNECTING with waiting_for_connection) happens before a transport exists, so gRPC KeepAlives (which are HTTP/2 frames) cannot physically be sent yet.

To solve the "stuck" connecting state, you need to tune the Transport Layer (TCP) Connect Timeout as rightly pointed out by you, not the HTTP/2 KeepAlive.

  1. Why keepAliveWithoutCalls Doesn't Help Here The waiting_for_connection state means your client is still trying to perform the TCP Handshake or TLS Handshake with the Spanner backend.

KeepAlive (HTTP/2 PING): This is a frame sent inside an established encrypted tunnel. It asks the server, "Are you still there?"

Connecting State: The tunnel does not exist yet. There is no path on which to send the PING.

Therefore, setting keepAliveWithoutCalls=true would not change anything for this specific error. It only helps if the connection successfully establishes, stays idle, and then the network silently drops it (e.g., a firewall killing a quiet connection).

  1. The Real Culprit: TCP "Black Holes" The error DEADLINE_EXCEEDED after 30s with waiting_for_connection indicates your RPC deadline fired while the channel was still trying to connect.

The fact that this persisted for 20 minutes suggests something amiss.

AgraVator avatar Dec 11 '25 14:12 AgraVator

Thanks for your response.

There is still something that doesn't make sense to me. When keepAliveWithoutCalls=true, the KeepAliveManager is started earlier, during onTransportStart. AFAICT this may also be before TCP Handshake or TLS handshake has completed as it is done after connect is called here but that is async. In that case there won't be a way to actually send the keepalive but the scheduled ping could perhaps detect the stuck connection here. This is why I was wondering if the keepAliveWithoutCalls would make a difference here.

I agree that there is likely something wrong with the networking but I am trying to make the grpc configuration more resistent to such issues. It is unfortunate if the netty specific timeout is needed as keep-alives seem like the general grpc concept that should help here.

scwhittle avatar Dec 11 '25 14:12 scwhittle

Keepalive shouldn't be used to limit the connection establishment time. Right now, each step in the handshaking should have its own timer. ChannelOptions.CONNECT_TIMEOUT_MILLIS is only for the connect() syscall, and the kernel has a default that is generally close to 10-20s. So that option isn't needed. But you still have TLS/ALTS level handshaking and HTTP/2.

The kernel handles the timeout for connect(), and Netty handles the TLS timer (OkHttp properly uses setSoTimeout for HTTP CONNECT, but doesn't for the TLS handshake). It looks like there's a hole in grpc-netty waiting on the HTTP/2 handshake (ditto in grpc-okhttp). That may not matter much when using TLS if the SETTINGS is properly included in the TLS handshake packet (I suspect that it is, because "why hasn't someone noticed this earlier?").

I also don't see a timeout for ALTS.

ALTS and grpc-okhttp TLS are both pretty important to have timeouts, and we should also fix the lack of HTTP/2 SETTINGS timeout while were's looking here.

(I wouldn't mind someone verifying my findings. In particular, I didn't see any connection timeout handling in InternalSubchannel, so I don't think it is centralized like on server-side. Centrally in InternalSubchannel wouldn't be a bad approach to the fix, although we didn't previously on purpose because we couldn't agree on cross-language behavior, but with happy eyeballs it may be time to revisit. We could finally implement https://github.com/grpc/grpc-java/issues/1943)

ejona86 avatar Dec 19 '25 02:12 ejona86

Thanks! So it sounds like there are some places that are perhaps missing timeouts that could lead to this stuck behavior.

If keep alive without calls isn't the proper way to enforce these timeouts/handshakes longterm, would it still be a mitigation while the other missing timeouts are addressed?

scwhittle avatar Dec 19 '25 08:12 scwhittle

There are things to improve, but I still haven't seen any evidence you are impacted by them. There are much more ordinary ways to see 30 second delays when establishing connections with pick-first. For example, if you believe CPU starvation is an issue, then the timeouts here don't really matter; the timeouts here are for networking issues.

Cross-referencing b/462499883

ejona86 avatar Dec 19 '25 15:12 ejona86

In this instance the connection was stuck for 20 minutes following a high cpu period when it was created, not 30 seconds. This is the original observed instance without much logging to debug further not the attempt to reproduce (where only 30 second delays are observed).

Other VMs in the same instance group and using the same networking were working fine in this period. Of course without logging/tracing it will be difficult to say exactly what happened, I am just trying to rule out this as a possibility and if it was possible to see if some settings could be changed to fix it.

scwhittle avatar Dec 19 '25 15:12 scwhittle

In b/462499883 I had been shared some logs of a reproduction. Everything looked right from the gRPC side, but RPCs were definitely not being sent and the DEADLINE_EXCEEDED kept saying waiting_for_connection. My best theory is that the call credentials got wedged, so sent out https://github.com/grpc/grpc-java/pull/12576 to improve the error messages to confirm.

It looks like there's a hole in grpc-netty waiting on the HTTP/2 handshake (ditto in grpc-okhttp). That may not matter much when using TLS if the SETTINGS is properly included in the TLS handshake packet (I suspect that it is, because "why hasn't someone noticed this earlier?").

I just checked with wireshark, and in the server → client exchange with a grpc-netty server, Application Data (which would contain SETTINGS) was in the same packet as Change Cipher Spec. So that could explain why we were seeing it as being reliable.

ejona86 avatar Dec 22 '25 19:12 ejona86

ALTS and grpc-okhttp TLS are both pretty important to have timeouts

I found a 20 second timeout for ALTS: https://github.com/grpc/grpc-java/blob/ead532b3910fbb3b005de59a958c6565b1e859aa/alts/src/main/java/io/grpc/alts/internal/AltsHandshakerStub.java#L85

ejona86 avatar Dec 22 '25 21:12 ejona86