Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter`
Motivation:
Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client passes a "backpack state" by forwarding multiple parameters through internal methods. This leads to several issues:
- Backpacking that many parameters makes
Retrying(Rpc)Clientharder to read and changes to it error-prone, as some parameters share the same types. - Understanding the retry control flow requires jumping between
AbstractRetryingClientandRetrying(Rpc)Client, reducing readability. - The distinction between the two states is unclear, making it harder to quickly grasp the high-level retry logic in Armeria.
This PR removes AbstractRetryingClient.State, splitting it into RetryContext and RetryCounter, which are now owned by Retrying(Rpc)Client. All methods referencing AbstractRetryingClient.State are removed. With that AbstractRetryingClient is left with only two non-trival methods: getDelay and scheduleNextRetry. As we plan to also extract scheduling from and then delete AbstractRetryingClient in a later PR, AbstractRetryingClient is already made private to simplify the overall change process to the public API [1].
Finally, the PR updates Retrying(Rpc)Client to use the new RetryContext and refactors its code for improved clarity.
Modifications:
- Make
AbstractRetryingClientprivate - Introduce
RetryCounterand move counting state and logic fromAbstractRetryingClient.Stateinto it - Introduce
RetryContext, an immutable value class holding the non-attempt specific retry state fromAbstractRetryingClient.Statelike the original request orHttpRequestDuplicator - Remove
AbstractRetryingClient.State - Introduce
RetryAttempt, an immutable class holding the attempt context and response - Rename variables inside
Retrying(Rpc)Clientfor better readability
Result:
- [Breaking]
AbstractRetryingClientis now private. - Improves readability of
RetryingClientandRetryingRpcClient
[1]: Making AbstractRetryingClient private should be less of a problem anyways as AbstractRetryingClient was likely not overridden and customized as the current design makes it hard to do so correctly. This is of course is not guaranteed and as such need to be considered if we need to provide a migration path. At the moment I did not implement one but please let me know if you think this is necessary.
🔍 Build Scan® (commit: 5cb758abefdc81058ed21781bd9642ab74614a88)
| Job name | Status | Build Scan® |
|---|---|---|
| build-ubicloud-standard-16-jdk-8 | ✅ | https://ge.armeria.dev/s/cotvx2ylbiowa |
| build-ubicloud-standard-16-jdk-21-snapshot-blockhound | ✅ | https://ge.armeria.dev/s/tifmja5bayu62 |
| build-ubicloud-standard-16-jdk-17-min-java-17-coverage | ❌ (failure) | https://ge.armeria.dev/s/mjqw4pqqqyjcu |
| build-ubicloud-standard-16-jdk-17-min-java-11 | ✅ | https://ge.armeria.dev/s/5o62d75zzdnqq |
| build-ubicloud-standard-16-jdk-17-leak | ✅ | https://ge.armeria.dev/s/eqjl6bryeuxog |
| build-ubicloud-standard-16-jdk-11 | ✅ | https://ge.armeria.dev/s/qvzgyhoimcugo |
| build-macos-latest-jdk-21 | ❌ (failure) | https://ge.armeria.dev/s/smu2v3oafie2q |
Codecov Report
:x: Patch coverage is 86.05442% with 41 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 74.09%. Comparing base (8150425) to head (d269081).
:warning: Report is 192 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #6411 +/- ##
============================================
- Coverage 74.46% 74.09% -0.37%
- Complexity 22234 23027 +793
============================================
Files 1963 2064 +101
Lines 82437 86166 +3729
Branches 10764 11310 +546
============================================
+ Hits 61385 63847 +2462
- Misses 15918 16906 +988
- Partials 5134 5413 +279
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
FYI: This PR is not exclusively for hedging. This PR is more of a cleanup so we all can better argue about RetryingClient - when discussing changes for hedging but also in general.
Let me address your question anyway.
I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could later be used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))
Abstracting away a single retry and returning some kind of handle is certainly something we need to do for hedging, yes. This is because we want to easily start the original retry as well as schedule the hedged one while keeping a reference to the attempts in order to abort the one that does not get committed/ that looses. Compare to the other changes this is a fairly simple and small change we can do when implementing hedging.
Maybe to give you a general overview, here are the things we need to have for hedging later on:
- A way to check whether attempts are in flight or retry tasks are scheduled. This determines whether a completing attempt is the last one (i.e.
maxAttemptsreached and no others active) and thus whether to abort or commit it. - A mechanism to abort in-flight attempts in case concurrent attempts commit.
- A mechanism to cancel scheduling a retry task in case another retry task was scheduled with a shorter delay.
- A mechanism to cancel scheduling a retry task when retry is completed.
- A mechanism to reschedule a retry task in case another attempt returns a server pushback that is greater than the retry task’s scheduling delay.
For 3.,4., and 5. I would use the RetryScheduler from the previous PR.
Originally, 1. and 2. were done by the RetriedRequest and with the help of using a single event loop. When aiming for the least invasive change set, for 1., we just need to have some counter in RetryContext and synchronize that with the RetryScheduler.
For 2. I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed. Every new attempt would attach a listener to it and call abortAttempt if they notice they lost the race. This future would also be used to call commitAttempt. The handlers in doExecute for rctx.req() and rctx.res() would complete this future first (exceptionally) and handles to committedAttempt would then call handleException. Furtheremore if an attempt was prepared and ready for decision it would check committedAttempt before to check whether the race is over already.
For 1. and 2. we would have to think carefully about concurrency though.
Hi @jrhee17, do you have any updates on this?
Sorry about the delay, just got back from a long vacation
I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed.
Sounds good to me 👍 Just wanted to make sure of the plan after this PR is merged.
For 3.,4., and 5. I would use the RetryScheduler from the previous PR.
I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.
I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.
Sure, I will have some time in the upcoming week but then I will get less available. So if I can get green light on this refactoring also from other maintainers I can build the e2e hedging prototype on top of that