feat: support request hedging
[!NOTE]
This PR is still a draft. This includes this description.
Motivation:
This PR enables RetryingClient and RetryingRpcClient to perform request hedging.
What is request hedging?
Detailed explanations can be found in the paper The Tail at Scale 1 and the section for request hedging in the gRPC documentation 2.
How does this implementation compare to the gRPC spec?
This implementation of request hedging tries to follow the semantics defined by gRPC as close as possible. As such, let us take the gRPC request hedging spec 4 as a baseline and discuss how it does (or does not) implement it. For that it makes sense to study the spec first before continuing reading.
Configuration
gRPC drives request hedging via its service config in which one can specify a so-called hedgingPolicy. It consists of three fields:
maxAttempts
maximum number of in-flight requests while waiting for a successful response. This is a mandatory field, and must be specified. If the specified value is greater than 5, gRPC uses a value of 5.
Limiting the number of attempts is still supported by RetryConfig.maxTotalAttempts(). This hedging mechanism respects that limit. However, we are not imposing an absolute limit on the number of possible attempts like gRPC.
hedgingDelay
amount of time that needs to elapse before the client sends out the next request while waiting for a successful response. This field is optional, and if left unspecified, results in maxAttempts number of requests all sent out at the same time.
We allow specifying this via hedgingDelay and hedgingDelayMillis in RetryConfigBuilder/ RetryConfig. The difference here is that we are not going to send all requests at once in case this option is unspecified but instead assume that the user wants to use the standard retry policy instead of hedging.
nonFatalStatusCodes
an optional list of grpc status codes. If any of hedged requests fails with a status code that is not present in this list, all outstanding requests are canceled and the response is returned to the application.
The pre-existing RetryRules are able to model this behaviour (and more). More specifically, a user can use RetryRuleBuilder.onStatus to decide on whether to continue retrying. TODO: implement differentiation between NO_RETRY and NEXT.
Server Pushback
gRPC defines a metadata key, grpc-retry-pushback-ms 3, for the server to communicate to the client a minimum retry delay. In this implementation we continue to respect the standard Retry-After header to enforce minimal retry delays. In case of an unparsable or negative headers, the client is going to abort the retrying process completly, as specified by gRPC:
If the value for pushback is negative or unparseble, then it will be seen as the server asking the client not to retry at all.
Adding support for the grpc-retry-pushback-ms metadata key could be a follow-up.
Throttling
Modifications
RetryConfigBuilder, RetryConfig
To enable request hedging RetryConfigBuilder now offers two methods to specify the hedging delay:
hedgingDelay(Duration hedgingDelay)hedgingDelayMillis(long hedgingDelayMillis)
As mentioned above, setting this to zero is valid and makes the Retrying(Rpc)Client immediately push out all maxRequests and make it wait for one to finish. If not set, hedging is disabled and Retrying(Rpc)Client behaves like the RetryingClient before this change. Negative values are illegal.
RetryScheduler
With request hedging, Retrying(Rpc)Client needs to handle concurrent retry attempts
trying to schedule retry tasks with different backoff delays and (optional) minimal delays from servers (see RETRY_AFTER HTTP header).
To handle this complexity and improve testability, I introduced RetryScheduler. It is used behind the scenes by AbstractRetryingClient to keep track of the next retry task to be executed while enforcing scheduling constraints given by the minimal delays given from the servers (earliestRetryTimeNanos) and the attempt/response timeout (latestNextRetryTimeNanos).
With that it takes care of retry tasks "overtaking" each other when a retry rule specifies a shorter delay than the currently scheduled task, and rescheduling when a server reports a longer minimum delay than the one already scheduled.
AbstractRetryingClient
TODO
RetryingClient
TODO
RetryingRpcClient
TODO
DefaultRequestLog / RequestLogBuilder
Added endResponseWithChild(RequestLogAccess child) to be able to end the request log with any child log and not only with the one from the last retry attempt. This is necessary a hedged request can still complete after the (hedged) requests started before it.
TODOs/ Open Questions
- [ ] Decide with the maintainers on the (breaking) API changes in
RetryRule,RetryConfig, andAbstractRetryingClient. - [ ] Separate ticket for introducing a
RetryingContexttoRetryingClient. It is supposed to bundle the huge amount of arguments that are passed inside the methods ofRetryingClient. - [ ] Decide with maintainers on follow-ups:
- Should supporting
grpc-retry-pushback-msbe a follow-up? - [ ] Add missing Javadoc
- [ ] Add missing docs
Result:
- Closes #5200
- Describe the consequences that a user will face after this PR is merged.
🔍 Build Scan® (commit: 9c1eccad4f3e8193242af9b147d5ef065b17bf4e)
| Job name | Status | Build Scan® |
|---|---|---|
| build-ubicloud-standard-16-jdk-8 | ❌ (failure) | https://ge.armeria.dev/s/52o5im5qioljk |
| build-ubicloud-standard-16-jdk-21-snapshot-blockhound | ❌ (failure) | https://ge.armeria.dev/s/h2jbo6jcoba2i |
| build-ubicloud-standard-16-jdk-17-min-java-17-coverage | ❌ (failure) | https://ge.armeria.dev/s/46u57fq5vvvt6 |
| build-ubicloud-standard-16-jdk-17-min-java-11 | ❌ (failure) | https://ge.armeria.dev/s/i2koucx7cc6wu |
| build-ubicloud-standard-16-jdk-17-leak | ❌ (failure) | https://ge.armeria.dev/s/nhpe3ffpx64s6 |
| build-ubicloud-standard-16-jdk-11 | ❌ (failure) | https://ge.armeria.dev/s/gxwpfgssg3np6 |
| build-macos-latest-jdk-21 | ❌ (failure) | https://ge.armeria.dev/s/wkyyphusnkzci |
Codecov Report
Attention: Patch coverage is 82.13802% with 132 lines in your changes missing coverage. Please review.
Project coverage is 74.65%. Comparing base (
8150425) to head (7927a92). Report is 104 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #6252 +/- ##
============================================
+ Coverage 74.46% 74.65% +0.18%
- Complexity 22234 22538 +304
============================================
Files 1963 1987 +24
Lines 82437 83520 +1083
Branches 10764 10859 +95
============================================
+ Hits 61385 62350 +965
- Misses 15918 15961 +43
- Partials 5134 5209 +75
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Hi team, while I’m squashing the last bugs, I’d love to hear what you think about this PR. First, it would be interesting to know if:
- you're fine with enabling hedging via
hedgingDelay(Millis)on aRetryConfig - you have suggestions for breaking this PR into smaller pieces (if you prefer that)
- you're fine with the changes to
AbstractRetryingClient, and what you think about making it private