redpanda raft: coordinated recovery

raft: coordinated recovery

Open jcsp opened this issue 3 years ago • 1 comments

trafficstars

Cover letter

This PR implements a new behavior where recovery of many partitions is scheduled in an orderly sequence with a limit on how many recoveries may be done to a particular target node, rather than all attempting to recover concurrently up to a concurrency limit on the sending side.

The result should be improved system stability at high partition counts when nodes are restarted under load.

Fixes: https://github.com/redpanda-data/redpanda/issues/5958

Backport Required

[ ] not a bug fix
[ ] issue does not exist in previous branches
[ ] papercut/not impactful enough to backport
[ ] v22.2.x
[ ] v22.1.x
[ ] v21.11.x

UX changes

Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.

Release notes

Sep 12 '22 18:09 jcsp

Is this the issue - https://github.com/redpanda-data/redpanda/issues/5958 ?

Oct 04 '22 19:10 mmedenjak

Note: RaftAvailabilityTest is a little unstable because of the possible long (5s) coordinator wait for a tick when recovering. This will benefit from a mode where recovery coordinator doesn't wait long if it can see the total partition count is well below the limit on concurrent recovery.

Nov 28 '22 22:11 jcsp

redpanda redpanda copied to clipboard

raft: coordinated recovery

Cover letter

Backport Required

UX changes

Release notes

redpanda
redpanda copied to clipboard