grpc-node Allow resetting global subchannel pool

We are hitting an edge case of using grpc-js in AWS lambda. There is a very tiny chance that the first few subchannels created in the AWS Lambda were not able to receive a connect event from the socket and hang forever.

A few attempts were tried but none of them fixs the issue other than recreating the subchannel. There's however no API that lets us do so.

On the other hand, AWS Lambda allows multiple invocations to reuse the same global subchannel pool. In the case the one global pool comes with dead connections, any subsequent invocations will hang there forever.

We figured it would be very helpful to allow callsites to recreate the global subchannel pool when necessary.

Sep 16 '21 00:09 huayang-codaio

The committers are authorized under a signed CLA.

:white_check_mark: Huayang Guo (35bf7d4a002b29c6e221e045e009b2bbad08f0f2)

Sep 16 '21 00:09 linux-foundation-easycla[bot]

I would like to address this problem, but I don't think this is the right solution. I think the proper way to handle this is by modifying the subchannel code to prevent this from getting stuck in that state. So, I think we should do one or both of the following:

Add a timeout to connection attempts. This is currently not implemented under the assumption that Node will take care of it, but if events can be missed that doesn't always work. The current design for this is in this document, which otherwise mostly matches subchannel behavior. The simple solution is to add a hardcoded 20 second timeout, and the more complex solution is to try to factor in the backoff timer.
Add a check somewhere in the subchannel code to check that the state of the internal http2 stream is consistent with the reported subchannel state, and change the subchannel state if it is not. For example, if the subchannel reports that it is CONNECTING, but stream.connecting is false, switch to the appropriate other state. I think this could fit in getConnectivityState, startCallStream, or a new periodic interval.

Sep 16 '21 21:09 murgatroid99

@murgatroid99 Thanks for the advices!

Yes I am happy to implement the connection timeout. Was hesitating because it's not part of any grpc standard options which I could find in the core protocol. Making it 20-second default could work but what about also making it configurable with a parameter like grpc.max_subchannel_connect_timeout (btw sounds better with max instead of min right?). WDYT?

I see the points of making stream.connecting consistent with subchannel state. Not sure if I could capture all the scenarios but I am happy to submit a version and let you take a look.

Anyways I plan submit a revision by Monday.

Sep 17 '21 17:09 huayang-codaio

As a side note, I could see in our deployment that the resetting global subchannel pool isn't the best solution since the connection could still be disrupted in the middle of an invocation. Implementing on the subchannel level is more flexible.

Sep 17 '21 17:09 huayang-codaio

OK, it looks like there is an existing channel argument "grpc.min_reconnect_backoff_ms" that actually sets the MIN_CONNECT_TIMEOUT parameter. It's called MIN and not MAX because in the algorithm I linked, if the backoff time is greater, it uses that value instead.

I have a good idea of how to implement the synchronization code I mentioned, so I can add that myself.

Sep 17 '21 20:09 murgatroid99

you are right. a bit different from what I supposed but I got how the parameters are set now.

if you could help implementing the algorithm, that would be fantastic. I am always happy to make a commit (w/ my understanding) or press-testing the changes in AWS. just let me know. thanks!

Sep 20 '21 17:09 huayang-codaio

Just one more scenario that I saw today where the MIN_CONNECT_TIMEOUT case would work better:

connection seems healthy at the beginning.
right before call_stream sends metadata, all subchannels became IDLE from READY, so that the call stream was queued.
the subchannel reconnecting hits an issue and hangs forever without MIN_CONNECT_TIMEOUT.
so the call_stream hangs forever.

This could also happen more often with lambda since the transition from READY to IDLE is kind of delayed until the lambda VM resumes so that easily overlaps with the client status check.

Sep 21 '21 19:09 huayang-codaio

I don't understand the relevance of any part of that scenario other than "the subchannel reconnecting hits an issue and hangs forever without MIN_CONNECT_TIMEOUT". That is the primary reason for having this timer.

If you want to guarantee that a call ends in a reasonable amount of time, you should set a deadline on it.

Sep 21 '21 21:09 murgatroid99

I don't understand the relevance of any part of that scenario other than "the subchannel reconnecting hits an issue and hangs forever without MIN_CONNECT_TIMEOUT". That is the primary reason for having this timer.

Right you totally got the relevant piece.

The other steps is just to describe the scenario such that the MIN_CONNECT_TIMEOUT timer fixes the issue but a static check (e.g. waitForReady) doesn't.

Sep 21 '21 22:09 huayang-codaio