gax-java
gax-java copied to clipboard
Poor deadline semantics when retries are enabled
If a client is configured with DEADLINE_EXCEEDED as a retryable error and the user sets a deadline using ApiCallContext#withTimeout. Then when user's deadline is met, all retry attempts will failed locally with a DEADLINE_EXCEEDED error, but the retry mechanism will continue trying with exponential backoff.
It would be better if the timeout in the ApiCallContext was re-purposed as the totalTimeout instead of a per rpc deadline
@garrettjonesgoogle I have flagged this as a feature request, but it could also be considered a bug - can you take a look and assign someone?
I believe that the timeout in ApiCallContext is really just meant for the timeout of a single attempt - @vam-google can clarify.
There are a few aspects to this
-
Bug 1: currently AttemptCallable sets the rpc timeout from retry settings onto the call context as a timeout (which will be set as a deadline on the underlying grpc context). The problem is that deadlines are absolute and GrpcCallContext will only decrease timeouts.For example if the current time is 0 and retry settings are configured to have a constant 10ms rpc deadline. When the first attempt is executed, it will set the request to timeout at 10. That attempt fails after 5 ms and the second attempt is delayed by 5ms. The second attempt runs at 10. AttempCallable will try to set a timeout of 10ms, but that will be rejected because that deadline (20) comes after the current deadline (10).
-
Bug 2: when the current deadline is reached, the retry logic completely ignores it and will continue to schedule retries that will be locally killed with a DEADLINE_EXCEEDED error. This is an extension of bug 1. But it can be triggered both by Bug 1 or a user setting a timeout on the api context
-
Feature request: @garrettjonesgoogle is right that the deadline in ApiCallContext is treated like an rpc deadline. But I'm proposing changing that behavior. The retry logic is generally transparent to the users, caller doesn't know if the callable they are dealing with is retryable or not. When a user calls a method and sets a deadline, they are trying to satisfy some requirement of their application at that callsite. They don't actually care about how long each attempt took. Also, changing this to be an overall deadline would make it more consistent with grpc in general. I think sticking to grpc semantics of deadlines will make things a lot easier to reason about
Minor updates: Bug 1+2 were fixed in #592.
When implementing the feature request, care should be taken to limit the rpc timeout to the remaining overall timeout. This will allow users to simply set the overall timeout and not worry about the rpc timeouts or hung rpcs overrunning the overall timeout
@igorbernstein2 As a followup from #1133, I am wondering if it makes sense to allow specifying RetrySettings in ApiCallContext/GrpcCallContext?
I think it would be nice to control retry behavior per call. However I'm not sure if the entirety of RetrySettings should be exposed. I think there are currently some efforts to push retries down into the grpc layer. @vam-google can you comment on this?