grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Transparent retry should be per-hedge/retry

Open ejona86 opened this issue 6 years ago • 9 comments

Transparent retry is limited to 1 attempt. It seems like that one attempt should be per-hedge and per-(higher-level) retry. In the current implementation transparent retry is limited to 1 attempt per RPC.

We should discuss whether the "scope" of the "1 attempt" and see if we could maybe even change the retry spec.

ejona86 avatar Nov 20 '18 21:11 ejona86

It seems like that one attempt should be per-hedge and per-(higher-level) retry.

I disagree with this because we can not tell if two hedging attempts are connecting to the same backend, in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.

dapengzhang0 avatar Nov 20 '18 21:11 dapengzhang0

in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.

That seems to be a normal case of "once we commit, kill all other streams."

ejona86 avatar Nov 20 '18 21:11 ejona86

in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.

Also, that argument would apply today. There one only one transparent retry mentioned in your example.

ejona86 avatar Nov 20 '18 21:11 ejona86

There are two levels of commitment:

  1. reached the application logic of the backend - no more transparent retry, but normal retry or hedging is allowed.
  2. committed - no more retry or hedging allowed. Once we commit, kill all other streams.

I was talking about level 1.

For level 1. commitment, we need a global flag noMoreTransparentRetry to monitor all hedges based on my argument, regardless of if transparent retry is limited to 1 attempt per hedge. Given that we still need this global flag, what's the benefit of making the change?

dapengzhang0 avatar Nov 20 '18 22:11 dapengzhang0

I don't think (1) exists quite like that. I don't think we need the global flag noMoreTransparentRetry. What is it protecting against?

Transparent retry is transparent because the application isn't really aware it happens. If it is wrong to do a transparent retry, then the initial RPC was wrong.

ejona86 avatar Nov 20 '18 22:11 ejona86

Another way of viewing transparent retry: transparent retry appears to the application as if there was a sleep() before doing the RPC. That is, it will only be visible as latency. That is still true, even if the RPC we issued was a retry/hedge.

ejona86 avatar Nov 20 '18 22:11 ejona86

What is it protecting against?

I think it's as the spec says "This extra caution is needed because this case involves extra load on the wire". So we only allow at most (maxAttempts + 1) times of load on the wire. Otherwise, it may be doubled (maxAttempts * 2)

dapengzhang0 avatar Nov 20 '18 22:11 dapengzhang0

I think there was a debate on limit of transparent retries and then the spec got updated to the current version. + @markdroth @ericgribkoff

dapengzhang0 avatar Nov 20 '18 22:11 dapengzhang0

Now we treat local-only transparent retry unlimited; and otherwise transparent-retry once, but the question is per-hedge or per overall RPC.

dapengzhang0 avatar Feb 18 '22 21:02 dapengzhang0