armeria icon indicating copy to clipboard operation
armeria copied to clipboard

feat: support request hedging

Open schiemon opened this issue 7 months ago • 1 comments

[!NOTE]
This PR is still a draft. This includes this description.

Motivation:

This PR enables RetryingClient and RetryingRpcClient to perform request hedging.

What is request hedging?

Detailed explanations can be found in the paper The Tail at Scale 1 and the section for request hedging in the gRPC documentation 2.

How does this implementation compare to the gRPC spec?

This implementation of request hedging tries to follow the semantics defined by gRPC as close as possible. As such, let us take the gRPC request hedging spec 4 as a baseline and discuss how it does (or does not) implement it. For that it makes sense to study the spec first before continuing reading.

Configuration

gRPC drives request hedging via its service config in which one can specify a so-called hedgingPolicy. It consists of three fields:

  • maxAttempts

maximum number of in-flight requests while waiting for a successful response. This is a mandatory field, and must be specified. If the specified value is greater than 5, gRPC uses a value of 5.

Limiting the number of attempts is still supported by RetryConfig.maxTotalAttempts(). This hedging mechanism respects that limit. However, we are not imposing an absolute limit on the number of possible attempts like gRPC.

  • hedgingDelay

amount of time that needs to elapse before the client sends out the next request while waiting for a successful response. This field is optional, and if left unspecified, results in maxAttempts number of requests all sent out at the same time.

We allow specifying this via hedgingDelay and hedgingDelayMillis in RetryConfigBuilder/ RetryConfig. The difference here is that we are not going to send all requests at once in case this option is unspecified but instead assume that the user wants to use the standard retry policy instead of hedging.

  • nonFatalStatusCodes

an optional list of grpc status codes. If any of hedged requests fails with a status code that is not present in this list, all outstanding requests are canceled and the response is returned to the application.

The pre-existing RetryRules are able to model this behaviour (and more). More specifically, a user can use RetryRuleBuilder.onStatus to decide on whether to continue retrying. TODO: implement differentiation between NO_RETRY and NEXT.

Server Pushback

gRPC defines a metadata key, grpc-retry-pushback-ms 3, for the server to communicate to the client a minimum retry delay. In this implementation we continue to respect the standard Retry-After header to enforce minimal retry delays. In case of an unparsable or negative headers, the client is going to abort the retrying process completly, as specified by gRPC:

If the value for pushback is negative or unparseble, then it will be seen as the server asking the client not to retry at all.

Adding support for the grpc-retry-pushback-ms metadata key could be a follow-up.

Throttling

Modifications

RetryConfigBuilder, RetryConfig

To enable request hedging RetryConfigBuilder now offers two methods to specify the hedging delay:

  • hedgingDelay(Duration hedgingDelay)
  • hedgingDelayMillis(long hedgingDelayMillis)

As mentioned above, setting this to zero is valid and makes the Retrying(Rpc)Client immediately push out all maxRequests and make it wait for one to finish. If not set, hedging is disabled and Retrying(Rpc)Client behaves like the RetryingClient before this change. Negative values are illegal.

RetryScheduler

With request hedging, Retrying(Rpc)Client needs to handle concurrent retry attempts trying to schedule retry tasks with different backoff delays and (optional) minimal delays from servers (see RETRY_AFTER HTTP header). To handle this complexity and improve testability, I introduced RetryScheduler. It is used behind the scenes by AbstractRetryingClient to keep track of the next retry task to be executed while enforcing scheduling constraints given by the minimal delays given from the servers (earliestRetryTimeNanos) and the attempt/response timeout (latestNextRetryTimeNanos). With that it takes care of retry tasks "overtaking" each other when a retry rule specifies a shorter delay than the currently scheduled task, and rescheduling when a server reports a longer minimum delay than the one already scheduled.

AbstractRetryingClient

TODO

RetryingClient

TODO

RetryingRpcClient

TODO

DefaultRequestLog / RequestLogBuilder

Added endResponseWithChild(RequestLogAccess child) to be able to end the request log with any child log and not only with the one from the last retry attempt. This is necessary a hedged request can still complete after the (hedged) requests started before it.

TODOs/ Open Questions

  • [ ] Decide with the maintainers on the (breaking) API changes in RetryRule, RetryConfig, and AbstractRetryingClient.
  • [ ] Separate ticket for introducing a RetryingContext to RetryingClient. It is supposed to bundle the huge amount of arguments that are passed inside the methods of RetryingClient.
  • [ ] Decide with maintainers on follow-ups: - Should supporting grpc-retry-pushback-ms be a follow-up?
  • [ ] Add missing Javadoc
  • [ ] Add missing docs

Result:

  • Closes #5200
  • Describe the consequences that a user will face after this PR is merged.

schiemon avatar May 23 '25 13:05 schiemon

🔍 Build Scan® (commit: 9c1eccad4f3e8193242af9b147d5ef065b17bf4e)

Job name Status Build Scan®
build-ubicloud-standard-16-jdk-8 ❌ (failure) https://ge.armeria.dev/s/52o5im5qioljk
build-ubicloud-standard-16-jdk-21-snapshot-blockhound ❌ (failure) https://ge.armeria.dev/s/h2jbo6jcoba2i
build-ubicloud-standard-16-jdk-17-min-java-17-coverage ❌ (failure) https://ge.armeria.dev/s/46u57fq5vvvt6
build-ubicloud-standard-16-jdk-17-min-java-11 ❌ (failure) https://ge.armeria.dev/s/i2koucx7cc6wu
build-ubicloud-standard-16-jdk-17-leak ❌ (failure) https://ge.armeria.dev/s/nhpe3ffpx64s6
build-ubicloud-standard-16-jdk-11 ❌ (failure) https://ge.armeria.dev/s/gxwpfgssg3np6
build-macos-latest-jdk-21 ❌ (failure) https://ge.armeria.dev/s/wkyyphusnkzci

github-actions[bot] avatar May 23 '25 14:05 github-actions[bot]

Codecov Report

Attention: Patch coverage is 82.13802% with 132 lines in your changes missing coverage. Please review.

Project coverage is 74.65%. Comparing base (8150425) to head (7927a92). Report is 104 commits behind head on main.

Files with missing lines Patch % Lines
...p/armeria/client/retry/AbstractRetryingClient.java 78.64% 25 Missing and 35 partials :warning:
.../linecorp/armeria/client/retry/RetryScheduler.java 78.20% 18 Missing and 16 partials :warning:
.../linecorp/armeria/client/retry/RetryingClient.java 89.17% 7 Missing and 10 partials :warning:
...ecorp/armeria/client/retry/RetryConfigBuilder.java 52.94% 7 Missing and 1 partial :warning:
...corp/armeria/common/logging/DefaultRequestLog.java 80.55% 3 Missing and 4 partials :warning:
...necorp/armeria/client/retry/RetryingRpcClient.java 90.47% 3 Missing and 3 partials :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6252      +/-   ##
============================================
+ Coverage     74.46%   74.65%   +0.18%     
- Complexity    22234    22538     +304     
============================================
  Files          1963     1987      +24     
  Lines         82437    83520    +1083     
  Branches      10764    10859      +95     
============================================
+ Hits          61385    62350     +965     
- Misses        15918    15961      +43     
- Partials       5134     5209      +75     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 19 '25 18:06 codecov[bot]

Hi team, while I’m squashing the last bugs, I’d love to hear what you think about this PR. First, it would be interesting to know if:

  • you're fine with enabling hedging via hedgingDelay(Millis) on a RetryConfig
  • you have suggestions for breaking this PR into smaller pieces (if you prefer that)
  • you're fine with the changes to AbstractRetryingClient, and what you think about making it private

schiemon avatar Jun 20 '25 16:06 schiemon