armeria Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter`

Motivation:

Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client passes a "backpack state" by forwarding multiple parameters through internal methods. This leads to several issues:

Backpacking that many parameters makes Retrying(Rpc)Client harder to read and changes to it error-prone, as some parameters share the same types.
Understanding the retry control flow requires jumping between AbstractRetryingClient and Retrying(Rpc)Client, reducing readability.
The distinction between the two states is unclear, making it harder to quickly grasp the high-level retry logic in Armeria.

This PR removes AbstractRetryingClient.State, splitting it into RetryContext and RetryCounter, which are now owned by Retrying(Rpc)Client. All methods referencing AbstractRetryingClient.State are removed. With that AbstractRetryingClient is left with only two non-trival methods: getDelay and scheduleNextRetry. As we plan to also extract scheduling from and then delete AbstractRetryingClient in a later PR, AbstractRetryingClient is already made private to simplify the overall change process to the public API [1].

Finally, the PR updates Retrying(Rpc)Client to use the new RetryContext and refactors its code for improved clarity.

Modifications:

Make AbstractRetryingClient private
Introduce RetryCounter and move counting state and logic from AbstractRetryingClient.State into it
Introduce RetryContext, an immutable value class holding the non-attempt specific retry state from AbstractRetryingClient.State like the original request or HttpRequestDuplicator
Remove AbstractRetryingClient.State
Introduce RetryAttempt , an immutable class holding the attempt context and response
Rename variables inside Retrying(Rpc)Client for better readability

Result:

[Breaking] AbstractRetryingClient is now private.
Improves readability of RetryingClient and RetryingRpcClient

[1]: Making AbstractRetryingClient private should be less of a problem anyways as AbstractRetryingClient was likely not overridden and customized as the current design makes it hard to do so correctly. This is of course is not guaranteed and as such need to be considered if we need to provide a migration path. At the moment I did not implement one but please let me know if you think this is necessary.

Sep 24 '25 08:09 schiemon

🔍 Build Scan® (commit: 5cb758abefdc81058ed21781bd9642ab74614a88)

Job name	Status	Build Scan®
build-ubicloud-standard-16-jdk-8	✅	https://ge.armeria.dev/s/cotvx2ylbiowa
build-ubicloud-standard-16-jdk-21-snapshot-blockhound	✅	https://ge.armeria.dev/s/tifmja5bayu62
build-ubicloud-standard-16-jdk-17-min-java-17-coverage	❌ (failure)	https://ge.armeria.dev/s/mjqw4pqqqyjcu
build-ubicloud-standard-16-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/5o62d75zzdnqq
build-ubicloud-standard-16-jdk-17-leak	✅	https://ge.armeria.dev/s/eqjl6bryeuxog
build-ubicloud-standard-16-jdk-11	✅	https://ge.armeria.dev/s/qvzgyhoimcugo
build-macos-latest-jdk-21	❌ (failure)	https://ge.armeria.dev/s/smu2v3oafie2q

Sep 24 '25 09:09 github-actions[bot]

Codecov Report

:x: Patch coverage is 86.05442% with 41 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 74.09%. Comparing base (8150425) to head (d269081). :warning: Report is 192 commits behind head on main.

Files with missing lines	Patch %	Lines
.../linecorp/armeria/client/retry/RetryingClient.java	88.19%	10 Missing and 9 partials :warning:
...om/linecorp/armeria/client/retry/RetryCounter.java	60.00%	8 Missing and 4 partials :warning:
...om/linecorp/armeria/client/retry/RetryContext.java	86.84%	2 Missing and 3 partials :warning:
...necorp/armeria/client/retry/RetryingRpcClient.java	89.79%	2 Missing and 3 partials :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6411      +/-   ##
============================================
- Coverage     74.46%   74.09%   -0.37%     
- Complexity    22234    23027     +793     
============================================
  Files          1963     2064     +101     
  Lines         82437    86166    +3729     
  Branches      10764    11310     +546     
============================================
+ Hits          61385    63847    +2462     
- Misses        15918    16906     +988     
- Partials       5134     5413     +279

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sep 24 '25 18:09 codecov[bot]

FYI: This PR is not exclusively for hedging. This PR is more of a cleanup so we all can better argue about RetryingClient - when discussing changes for hedging but also in general.

Let me address your question anyway.

I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could later be used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))

Abstracting away a single retry and returning some kind of handle is certainly something we need to do for hedging, yes. This is because we want to easily start the original retry as well as schedule the hedged one while keeping a reference to the attempts in order to abort the one that does not get committed/ that looses. Compare to the other changes this is a fairly simple and small change we can do when implementing hedging.

Maybe to give you a general overview, here are the things we need to have for hedging later on:

A way to check whether attempts are in flight or retry tasks are scheduled. This determines whether a completing attempt is the last one (i.e. maxAttempts reached and no others active) and thus whether to abort or commit it.
A mechanism to abort in-flight attempts in case concurrent attempts commit.
A mechanism to cancel scheduling a retry task in case another retry task was scheduled with a shorter delay.
A mechanism to cancel scheduling a retry task when retry is completed.
A mechanism to reschedule a retry task in case another attempt returns a server pushback that is greater than the retry task’s scheduling delay.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

Originally, 1. and 2. were done by the RetriedRequest and with the help of using a single event loop. When aiming for the least invasive change set, for 1., we just need to have some counter in RetryContext and synchronize that with the RetryScheduler.

For 2. I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed. Every new attempt would attach a listener to it and call abortAttempt if they notice they lost the race. This future would also be used to call commitAttempt. The handlers in doExecute for rctx.req() and rctx.res() would complete this future first (exceptionally) and handles to committedAttempt would then call handleException. Furtheremore if an attempt was prepared and ready for decision it would check committedAttempt before to check whether the race is over already.

For 1. and 2. we would have to think carefully about concurrency though.

Sep 30 '25 07:09 schiemon

Hi @jrhee17, do you have any updates on this?

Oct 06 '25 21:10 schiemon

Sorry about the delay, just got back from a long vacation

I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed.

Sounds good to me 👍 Just wanted to make sure of the plan after this PR is merged.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

Oct 17 '25 08:10 jrhee17

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

Sure, I will have some time in the upcoming week but then I will get less available. So if I can get green light on this refactoring also from other maintainers I can build the e2e hedging prototype on top of that

Oct 18 '25 17:10 schiemon