armeria icon indicating copy to clipboard operation
armeria copied to clipboard

Add retry limiter support

Open friscoMad opened this issue 5 months ago • 4 comments

Motivation:

As we discussed over discord there is currently no support for add proper retry limiting, this PR adds it with a couple of working retry limiter implementations, one based on GRPC logic and another that uses a RateLimiter.

Modifications:

  • Added some methods to be able to set a RetryLimiter while building a RetryConfig
  • Hooked the retry limiter inside RetryClient
  • Created a RetryLimiter interface and a couple of implementations.

Result:

  • Closes #6282
  • There are changes in public APIs in RetryConfig constructor and protected getNextDelay methods in AbstractRetryingClient
  • It will allow clients to add a retry throttling logic if needed.

friscoMad avatar Jul 21 '25 11:07 friscoMad

🔍 Build Scan® (commit: ed307b2003824fc049d89b6a3eba3ec3464755b9)

Job name Status Build Scan®

github-actions[bot] avatar Jul 21 '25 12:07 github-actions[bot]

Codecov Report

:x: Patch coverage is 96.49123% with 4 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 74.13%. Comparing base (8150425) to head (ed307b2). :warning: Report is 201 commits behind head on main.

Files with missing lines Patch % Lines
...ecorp/armeria/client/retry/RetryConfigBuilder.java 75.00% 2 Missing :warning:
...armeria/client/retry/limiter/GrpcRetryLimiter.java 98.21% 0 Missing and 1 partial :warning:
...orp/armeria/client/retry/limiter/RetryLimiter.java 75.00% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6318      +/-   ##
============================================
- Coverage     74.46%   74.13%   -0.33%     
- Complexity    22234    23030     +796     
============================================
  Files          1963     2065     +102     
  Lines         82437    86225    +3788     
  Branches      10764    11326     +562     
============================================
+ Hits          61385    63924    +2539     
- Misses        15918    16882     +964     
- Partials       5134     5419     +285     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jul 21 '25 17:07 codecov[bot]

If we are accepting the breaking changes then we should release this PR together with #6292 to prevent two successive breaking API changes of the same API after each other

schiemon avatar Jul 24 '25 10:07 schiemon

Sorry about the late review, some rough thoughts I had while taking an initial look:

* `RetryConfig#requiresResponseTrailers` dictates whether to wait for trailers before deciding whether to retry or not. Given the current implementation, I'm unsure if `GrpcRetryLimiter` will be able to consistently retrieve grpc trailers for streaming responses. (e.g. If `RetryRule.RetryRule.onUnprocessed()` is used, `RetryLimiter#onCompletedAttempt` may be invoked before the trailers arrive for a streaming response). In general (and not just gRPC), I guess users will have to keep track of what properties will be available at `RetryLimiter` based on how the `RetryConfig#requiresResponseTrailers` is set.

Yes, and this is difficult even for me trying to understand how the code works, I am not sure if there is a reason for having a gRPC retry limiter without a gRPC retry rule, but maybe we can make some changes to make that decision explicit in RetryLimiter and enable requiresResponseTrailers if the limiter wants it.

* When determining whether a retry should be done or not, if users would like to retry on a certain `grpc-status`, it seems like he/she should set both the `GrpcRetryLimiter` and the `RetryRule` since they seem to have an AND relationship. (if they would like to use `GrpcRetryLimiter`). In general, I'm still trying to organize the exact distinction `RetryRule` and `RetryLimiter.`

RetryRule decides if a retry shoud be done, retry limiter can block a retry based on whatever rules are implemented (like max retries in flight, max retries per minute, token bucket...) the retry limiter only is triggered if a retry is going to be done. GrpcRetryLimiter tries to mimic the Grpc retry throttling logic defined here: https://github.com/grpc/proposal/blob/master/A6-client-retries.md#throttling-retry-attempts-and-hedged-rpcs it is expected to be used with a GrpcRetry rule, but the rule only covers the retriable statuses of the grpc retry policy. Everything else is included in parts of the armeria retry config.

e.g.

RetryRule.builder()
                 .onGrpcTrailers((ctx, trailers) -> trailers.containsInt("grpc-status", 15))
...
new GrpcRetryLimiter(0, 0, 0, List.of(15))
* Maybe nit, but it seems like `RetryLimiter#shouldRetry` may be called even if a retry won't actually execute (e.g. due to `setResponseTimeout` checks, request abortion, etc.). While it may not be a big deal, this may be a little awkward for the `RetryRateLimiter` implementation which acquires.

This is a valid concern and I am happy to move the check to another point in the timeline, I have a lot of difficulties to fully understand the flow and find where to add the 2 hooks (shouldRetry and onCompletedAttempt), it is true that we want to check only when we are going to do the retry but also waiting for the backoff just to cancel the retry is a bit weird. I think that as long as abortions and errors are flows that are exceptional and not frequently used, I think it is fine to miscount by 1 from time to time. But I have not an strong opinion we can move the check to be done as late as possible and that will be fine for me too.

I'm nitpicking because I'm still trying to grasp the full picture at the moment. I would also like to do some research on gRPC's implementation (as well as envoy's) to have a better idea on retry budgets in general before leaving more comments. No worries, it is a very niche topic, and my feeling is that with all the edge cases that can happen with IO, delays... we can only expect to do it "good enough".

friscoMad avatar Aug 13 '25 16:08 friscoMad