sdk-go maxRPCTimeout value is too long

When creating a gRPC context for a call, we take the existing context deadline, calculate the remaining time, divide by two, and then either round up the RPC timeout to a minimum (currently 1 second) or round it down to a maximum (currently 10 seconds).

The problem here is that with a 10 second timeout, if the gRPC connection is uncleanly severed with a request is in flight, then it will take the full 10 seconds before the call is failed and then retried. For long polls requests, this is even worse with a 70 second time out. As a result, SDK users can see spikes in overall latencies if they Temporal Cluster they are connecting to is undergoing an upgrade or recycling.

We should do a few things to enable the user to minimize the exposure here depending on their scenario. Options include (but are not limited to):

reduce the maximum gRPC timeout from 10 seconds to something smaller (of course, this could impact scenarios where the customer is sending/receiving large payloads on a low-bandwidth connection).
Enable the user to specify what the maximum gRPC request timeout should be so that they can make the decision for themselves.
Have more intelligently calculated timeouts or consider decoupling timeout from the context timeout on the request.

May 14 '21 19:05 mastermanu

Do we know what is our p99.9 for the cloud service calls (excluding long poll)? Maybe we should adjust the default and set it to a multiple of that as an upper limit for all non long-poll calls? But then as you've mentioned we may start timing out calls with larger payloads. I'm not a big fan of being too smart with timeouts and trying to calculate them on the fly based on payload sizes or things alike as it may result in unexpected behavior that can be difficult to debug. On one hand I think making maxRPCTimeout configurable could solve the problem of large payloads, allowing us to have lower the default, but on the other if users are going to hit the limit often enough they would start overriding it back to higher values, which would bring us back to the square one. For long polls I don't think we can do much better besides keepalive check as we want those timeouts to be as long as they are. Better question, I think, is why would we abruptly drop connections during maintenance on the server? Isn't it possible to do a graceful shutdown of a node instead? Then we would avoid this problem in a first place.

May 27 '21 05:05 vitarb

agree that we should try our best to make sure node shutdowns are graceful. but that can never truly be guaranteed. also agree that smart logic for timeouts is ugly. Will get back to you on P99 latency, but at least letting maxRPCTimeout be configurable (w/o changing current default) might be an okay option for now?

May 27 '21 23:05 mastermanu