sdk-java GrpcRetryer fails without retrying request if "getServiceCapabilities" request fails

Expected Behavior

Request to temporal service is retried according to RpcRetryOptions

Actual Behavior

https://github.com/temporalio/sdk-java/blob/2b05f073e4b45a79cb447fdeae6ef8f704ee1423/temporal-serviceclient/src/main/java/io/temporal/internal/retryer/GrpcRetryer.java#L60 If serviceCapabilities.get() fails (not initialized yet + network issue or whatever) actual request under retry is not getting retried according to retry policies, but fails immediately instead.

Steps to Reproduce the Problem

Create brand new WorkflowClient to temporal server that returns an error upon system info request (i.e. HTTP 503). Basically any response that is not successful or UNIMPLEMENTED in grpc terms
Send any command
Request fails immediately without retrying

Specifications

temporal-sdk

Version: 1.22.3

Feb 29 '24 15:02 aleskovets

I believe this is intentional because this is basically a connect-time call that should fail the connect attempt not hang it. I think we want connection failures (which we define connection as TCP connect + this call) to fail immediately.

Feb 29 '24 16:02 cretz

it is weird though. If you track down usage it only affects error case and result of this call is only used to read "differentiate internal errors" flag. However, failure in this request prevents real request from getting retried in a good way that would have been retried and recovered otherwise.

It is really less about whther "systemCapabilities.get()" itself retried, but rather about "real" request being retried according to RpcRertyOptions.

Feb 29 '24 16:02 aleskovets

result of this call is only used to read "differentiate internal errors" flag

Not exactly. This call replaced our traditional gRPC health check call which means it also eagerly checks connectivity and auth. It just so happens that the result is a set of capabilities (which is why we moved away from gRPC health check, because we wanted to provide more information on the eager-RPC-to-check-connectivity call).

We use this call as a health/connectivity check.

Feb 29 '24 16:02 cretz

It makes sense that it is a health/connectivity check, but my point is that even if service is temporarily unhealthy (perfectly normal in real world of networking/pod restarts/whatever) it is expected that workflow client call is retried according to the rpc retry options. I.e. I know it could be temporarily down and I want workflow start call to be retried for a minute before giving up. Connectivity check failure basically prevents it making retry options useless.

Feb 29 '24 16:02 aleskovets

it is expected that workflow client call is retried according to the rpc retry options

Correct, actual client call should but connection doesn't retry based on those options, and the health call is part of connecting. I think this may be due to lazy connectivity. If you want to separate the connection from the first call with a client, you can use WorkflowServiceStubs.newConnectedServiceStubs. This will guarantee that your workflow client calls don't lazily attempt connectivity (which is not a retryable thing).

Feb 29 '24 16:02 cretz

Sure, this would have solved my narrow issue. But now it creates inconsistent behavior where eagerly created WorkflowServiceStubs.newConnectedServiceStubs are getting retried on instantiation in a good way vs lazy ones are just yolo'ing system info requests once at a random time and give up

Feb 29 '24 17:02 aleskovets

WorkflowServiceStubs.newConnectedServiceStubs are getting retried on instantiation

From what I understand, they aren't getting retried on instantiation. I don't believe connection (which is connect + get-system-info) is ever retried, only the post-connection RPC calls. This just shifts where connection happens, but the behavior is the same. We want connection to succeed or fail immediately without retry, regardless of whether it's eager or lazy. Sorry if I am misunderstanding.

Feb 29 '24 18:02 cretz

First of all, thank you for engaging with me - I appreciate your time.

Let's try to look at this issue from end user perspective - maybe I am having the wrong expectations.

WorkflowServiceStubsOptions.setRpcRetryOptions

Allows customization of retry options for the outgoing RPC calls to temporal service.

Note that default values should be reasonable for most users, be cautious when changing these values as it may result in increased load to the temporal backend or bad network instability tolerance.

Defaults are:
Retries are limited by the maximum period of 1 minute
Initial period between retries: 50ms
Exponential Backoff Coefficient (exponential rate) for the retry period is 2

By reading this - the purpose of configuring retries is to make WorkflowClient service calls reliable. Presumably - it is to mitigate intermittent issues that might occur during RPC request to temporal. Is it correct understanding? If so - failures might occur if target temporal instance is temporarily unreachable/unhealthy and thus they would be automatically retried according to RpcRetryOptions to overcome an issue, right? However, what really happens when service is temporarily unreachable/unhealthy is request failure due to pre-connection health request failure. Which makes no sense as service being temporary unreachable/unhealthy is exactly the problem that RPC retry is supposed to solve.

Following is IMO: it should not matter whether pre-conneciton or post-connection part of request is failing - from the usage perspective those are inseparable and should be treated as such when retry operation is applied.

Feb 29 '24 19:02 aleskovets

First of all, thank you for engaging with me - I appreciate your time.

No prob! Also feel free to join us on #java-sdk on https://t.mp/slack.

the purpose of configuring retries is to make WorkflowClient service calls reliable. Presumably - it is to mitigate intermittent issues that might occur during RPC request to temporal. Is it correct understanding?

Yes

If so - failures might occur if target temporal instance is temporarily unreachable/unhealthy and thus they would be automatically retried according to RpcRetryOptions to overcome an issue, right?

For the actual RPC call, not for first connection to the actual service though (which includes the get-system-info). But for connection failure of an already established client, definitely.

However, what really happens when service is temporarily unreachable/unhealthy is request failure due to pre-connection health request failure. Which makes no sense as service being temporary unreachable/unhealthy is exactly the problem that RPC retry is supposed to solve.

Only if you chose to combine RPC calls with initial connection. If you already made initial connection, then RPC calls are just RPC calls. RPC retry is only for RPC calls, not for initial connection. Initial connection in this case is defined as "open TCP + get-system-info call".

Following is IMO: it should not matter whether pre-conneciton or post-connection part of request is failing - from the usage perspective those are inseparable and should be treated as such when retry operation is applied.

It is unreasonable for many users to not eagerly fail when initially connecting (for host unreachable, auth fail, etc). Imagine writing a Java CLI that takes several seconds to fail for an unreachable host because initial connection is retried. But what you say makes perfect sense on failure after initial connection.

The retry applies to the single gRPC request the high-level client call is wrapping, not the entire high-level client call that may lazily connect and make get-system-info call. If you want the gRPC request to be 1:1 with the high-level client call (and therefore retry to apply to the high-level client call), eagerly connect thereby ensuring there is no conn+get-system-info call done lazily.

Feb 29 '24 20:02 cretz

Imagine writing a Java CLI

Funny you said that as it is exactly the thing I am solving for. My CLI app is run as a part of longer pipeline and is meant to connect to temporal, send command (start workflow) and exit. If temporal server is unavailable - it is going to fail "lazy" request. So, it looks like, the only real option for CLI apps is to always use eager connection with decent timeout as alternative is just not reliable.

Feb 29 '24 20:02 aleskovets

So, it looks like, the only real option for CLI apps is to always use eager connection with decent timeout as alternative is just not reliable.

Definitely. In fact, I would always suggest using eager connection (and in every one of our SDKs besides Java that is strongly preferred). This will help you differentiate between connection failure and RPC failure.

Also note that for connection-already-established clients that have some form of connection failure, connection is often reestablished internally by gRPC as part of the RPC call so from the RPC caller POV it does perform standard retries. It's just the very first connection that does not retry.

Feb 29 '24 20:02 cretz