redpanda
redpanda copied to clipboard
raft: coordinated recovery
Cover letter
This PR implements a new behavior where recovery of many partitions is scheduled in an orderly sequence with a limit on how many recoveries may be done to a particular target node, rather than all attempting to recover concurrently up to a concurrency limit on the sending side.
The result should be improved system stability at high partition counts when nodes are restarted under load.
Fixes: https://github.com/redpanda-data/redpanda/issues/5958
Backport Required
- [ ] not a bug fix
- [ ] issue does not exist in previous branches
- [ ] papercut/not impactful enough to backport
- [ ] v22.2.x
- [ ] v22.1.x
- [ ] v21.11.x
UX changes
Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.
Release notes
Is this the issue - https://github.com/redpanda-data/redpanda/issues/5958 ?
Note: RaftAvailabilityTest is a little unstable because of the possible long (5s) coordinator wait for a tick when recovering. This will benefit from a mode where recovery coordinator doesn't wait long if it can see the total partition count is well below the limit on concurrent recovery.