grpc-java
grpc-java copied to clipboard
Transparent retry should be per-hedge/retry
Transparent retry is limited to 1 attempt. It seems like that one attempt should be per-hedge and per-(higher-level) retry. In the current implementation transparent retry is limited to 1 attempt per RPC.
We should discuss whether the "scope" of the "1 attempt" and see if we could maybe even change the retry spec.
It seems like that one attempt should be per-hedge and per-(higher-level) retry.
I disagree with this because we can not tell if two hedging attempts are connecting to the same backend, in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.
in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.
That seems to be a normal case of "once we commit, kill all other streams."
in this case if the first hedging attempt receives a non-fatal status from the server's application logic, and after that the second hedging attempt fails with a "transparent-retryable" failure, then it does not make sense to do transparent retry for the second attempt because the backend's application logic has already run.
Also, that argument would apply today. There one only one transparent retry mentioned in your example.
There are two levels of commitment:
- reached the application logic of the backend - no more transparent retry, but normal retry or hedging is allowed.
committed- no more retry or hedging allowed. Once we commit, kill all other streams.
I was talking about level 1.
For level 1. commitment, we need a global flag noMoreTransparentRetry to monitor all hedges based on my argument, regardless of if transparent retry is limited to 1 attempt per hedge. Given that we still need this global flag, what's the benefit of making the change?
I don't think (1) exists quite like that. I don't think we need the global flag noMoreTransparentRetry. What is it protecting against?
Transparent retry is transparent because the application isn't really aware it happens. If it is wrong to do a transparent retry, then the initial RPC was wrong.
Another way of viewing transparent retry: transparent retry appears to the application as if there was a sleep() before doing the RPC. That is, it will only be visible as latency. That is still true, even if the RPC we issued was a retry/hedge.
What is it protecting against?
I think it's as the spec says "This extra caution is needed because this case involves extra load on the wire". So we only allow at most (maxAttempts + 1) times of load on the wire. Otherwise, it may be doubled (maxAttempts * 2)
I think there was a debate on limit of transparent retries and then the spec got updated to the current version. + @markdroth @ericgribkoff
Now we treat local-only transparent retry unlimited; and otherwise transparent-retry once, but the question is per-hedge or per overall RPC.