redpanda
redpanda copied to clipboard
r/consensus: do not require learner promotion to leave joint consensus
Cover letter
In redpanda Raft implementation reconfiguration cancellation is done by reversing the direction of configuration change. When Raft group configuration change is in first phase i.e. new nodes are added as learners to current configuration then they are simply removed. Cancellation of change when reconfiguration entered a Joint state requires swapping old and new configurations in Joint raft group configuration. It may be the case that cancellation will never finish even if only one node is unavailable as the node may be a voter that was demoted to learner in the last step before its removal.
In order to allow the configuration change to finish we allow Raft to leave joint consensus before all learners are promoted to voters. This change is safe as learners does not change the safety guarantees but enables us to reliably cancel partition movement when one of the nodes is down.
Fixes #ISSUE-NUMBER, Fixes #ISSUE-NUMBER, ...
Backport Required
- [ ] not a bug fix
- [ ] issue does not exist in previous branches
- [ ] papercut/not impactful enough to backport
- [ ] v22.2.x
- [ ] v22.1.x
- [ ] v21.11.x
UX changes
Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.
Release notes
/ci-repeat 5 debug skip-units dt-repeat=10 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py
/ci-repeat 5 debug skip-units dt-repeat=5 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py
/ci-repeat 5 debug skip-units dt-repeat=5 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py
/ci-repeat 1
Cancellation of change when reconfiguration entered a Joint state requires swapping old and new configurations in Joint raft group configuration.
@mmaslankaprv can you add a full list of configuration transitions that led to a bug? I'm looking at configuration_change_strategy_v4::cancel_update_in_joint_state
at it appears to leave the joint state, no? (i.e. _cfg._old = nullptr
after it is done)
We established that this is not required. Thank you @ztlpn
The example that we discussed, for posterity:
(1,2,3)->(1,2,4)
init: c: v:(1,2,3), l:() | o: -
1. C: v: (1,2,3), l:(4) | o: - transitional
2. C: v: (1,2,3,4) l: () | o: - transitional
3. C: v: (1,2,4) l: () | o: v:(1,2,3,4), l: () - joint
4. C: v: (1,2,4) l: () | o: v:(1,2,4), l: (3) - joint
cancel @ 1.
2': C: v:(1,2,3), l:() | o: -
cancel @ 2.
3' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
4' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint
cancel @ 3.
4' C: v:(1,2,3,4), l: () : o: -
5' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
6' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint
7. C: v: (1,2,3) l: () | o: - simple
cancel @ 4.
4' C: v:(1,2,4), l: (3) : o: - transitional
5' C: v: (1,2,3,4) l: () | o: - transitional
6' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
7' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint
8. C: v: (1,2,3) l: () | o: - simple
The problem is with the case cancel @ 4
. If the node 3 is unavailable at step 4' we are stuck. But it is a transitional configuration, not joint, so the change in the PR doesn't really help.