redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

r/consensus: do not require learner promotion to leave joint consensus

Open mmaslankaprv opened this issue 2 years ago • 4 comments

Cover letter

In redpanda Raft implementation reconfiguration cancellation is done by reversing the direction of configuration change. When Raft group configuration change is in first phase i.e. new nodes are added as learners to current configuration then they are simply removed. Cancellation of change when reconfiguration entered a Joint state requires swapping old and new configurations in Joint raft group configuration. It may be the case that cancellation will never finish even if only one node is unavailable as the node may be a voter that was demoted to learner in the last step before its removal.

In order to allow the configuration change to finish we allow Raft to leave joint consensus before all learners are promoted to voters. This change is safe as learners does not change the safety guarantees but enables us to reliably cancel partition movement when one of the nodes is down.

Fixes #ISSUE-NUMBER, Fixes #ISSUE-NUMBER, ...

Backport Required

  • [ ] not a bug fix
  • [ ] issue does not exist in previous branches
  • [ ] papercut/not impactful enough to backport
  • [ ] v22.2.x
  • [ ] v22.1.x
  • [ ] v21.11.x

UX changes

Describe in plain language how this PR affects an end-user. What topic flags, configuration flags, command line flags, deprecation policies etc are added/changed.

Release notes

mmaslankaprv avatar Oct 17 '22 18:10 mmaslankaprv

/ci-repeat 5 debug skip-units dt-repeat=10 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py

mmaslankaprv avatar Oct 18 '22 06:10 mmaslankaprv

/ci-repeat 5 debug skip-units dt-repeat=5 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py

mmaslankaprv avatar Oct 18 '22 09:10 mmaslankaprv

/ci-repeat 5 debug skip-units dt-repeat=5 tests/rptest/tests/partition_balancer_test.py tests/rptest/tests/partition_move_interruption_test.py

mmaslankaprv avatar Oct 18 '22 17:10 mmaslankaprv

/ci-repeat 1

mmaslankaprv avatar Oct 19 '22 09:10 mmaslankaprv

Cancellation of change when reconfiguration entered a Joint state requires swapping old and new configurations in Joint raft group configuration.

@mmaslankaprv can you add a full list of configuration transitions that led to a bug? I'm looking at configuration_change_strategy_v4::cancel_update_in_joint_state at it appears to leave the joint state, no? (i.e. _cfg._old = nullptr after it is done)

ztlpn avatar Nov 25 '22 13:11 ztlpn

We established that this is not required. Thank you @ztlpn

mmaslankaprv avatar Nov 25 '22 16:11 mmaslankaprv

The example that we discussed, for posterity:

(1,2,3)->(1,2,4)

init: c: v:(1,2,3), l:() | o: - 

1. C: v: (1,2,3), l:(4) | o:   - transitional
2. C: v: (1,2,3,4) l: () | o:  - transitional
3. C: v: (1,2,4) l: () | o: v:(1,2,3,4), l: () - joint
4. C: v: (1,2,4) l: () | o: v:(1,2,4), l: (3) - joint

cancel @ 1.

2': C: v:(1,2,3), l:() | o: - 

cancel @ 2.

3' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
4' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint

cancel @ 3.

4' C:  v:(1,2,3,4), l: () : o: - 
5' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
6' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint
7. C: v: (1,2,3) l: () | o:  - simple

cancel @ 4.

4' C:  v:(1,2,4), l: (3) : o: - transitional
5' C: v: (1,2,3,4) l: () | o: - transitional
6' C: v: (1,2,3) l: () | o: v:(1,2,3,4), l: () - joint
7' C: v: (1,2,3) l: () | o: v:(1,2,3), l: (4) - joint
8. C: v: (1,2,3) l: () | o:  - simple

The problem is with the case cancel @ 4. If the node 3 is unavailable at step 4' we are stuck. But it is a transitional configuration, not joint, so the change in the PR doesn't really help.

ztlpn avatar Nov 25 '22 18:11 ztlpn