redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

raft: coordinated recovery

Open jcsp opened this issue 3 years ago • 1 comments
trafficstars

Cover letter

This PR implements a new behavior where recovery of many partitions is scheduled in an orderly sequence with a limit on how many recoveries may be done to a particular target node, rather than all attempting to recover concurrently up to a concurrency limit on the sending side.

The result should be improved system stability at high partition counts when nodes are restarted under load.

Fixes: https://github.com/redpanda-data/redpanda/issues/5958

Backport Required

  • [ ] not a bug fix
  • [ ] issue does not exist in previous branches
  • [ ] papercut/not impactful enough to backport
  • [ ] v22.2.x
  • [ ] v22.1.x
  • [ ] v21.11.x

UX changes

Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.

Release notes

jcsp avatar Sep 12 '22 18:09 jcsp

Is this the issue - https://github.com/redpanda-data/redpanda/issues/5958 ?

mmedenjak avatar Oct 04 '22 19:10 mmedenjak

Note: RaftAvailabilityTest is a little unstable because of the possible long (5s) coordinator wait for a tick when recovering. This will benefit from a mode where recovery coordinator doesn't wait long if it can see the total partition count is well below the limit on concurrent recovery.

jcsp avatar Nov 28 '22 22:11 jcsp