jgroups-raft Leader step down when removing itself

Opening a PR since it touches the election part a bit.

The idea here is to check the election implementation after removing a member. If the current leader is removing itself, it steps down locally to stop accepting subsequent requests. Then, the old leader starts the voting thread to collect information from the other RAFT members. More details are listed in the issue tracker, but the idea here borrows a bit from the leader transfer in the Raft dissertation.

A leader needs to have an up-to-date log. After the membership change is committed, there are a majority of nodes with a recent log. Therefore, we can safely start an election round with the remaining members.

The change in the election is to request the RAFT members instead of the view members. Executing the test and Jepsen suites doesn't show any problem with this restriction. We could enhance the test coverage here. One interesting test scenario would enqueue a few requests and execute the leader removal. I plan to extend the test suite in future changes, so I'll make a note of that.

Closes #246.

Jan 28 '24 18:01 jabolina

I think we might have a bug with only this. After the leader steps down and start the voting thread, the remote nodes might not have committed yet, as the message is still in-flight.

I think the node should start the voting thread only after the commit index of other nodes advance. Also, the dissertation mentions that reconfiguration is applied when received, even before the commit. That might be something at play here.

I'll need to dig a little more.

Jan 28 '24 23:01 jabolina

This is quite tricky. Imagine a scenario with nodes A and B, where A is the leader. If A removes itself and shuts before B applies the command, the system becomes unavailable because node B still thinks the members are {A, B} and the majority is 1.

In one approach, we could delay the command completion until the leader identifies that the majority of followers have applied the command locally. See, this should only delay the CompletableFuture completion and return to the client. The command has to be applied in the same order. This would lead to a leader knowing it was removed but continue replicating until a majority applies the command to their local machine. Only after that the node steps down, trigger the election round, and then complete the CF. This would eliminate any possibility of unavailability.

Another approach follows the dissertation. The nodes adopt the new configuration as soon as they know about it. This would require still more changes. There is a bug in this approach [1], corrected by making a new leader commit an internal no-op entry before applying any membership change. But honestly, I can't see how that would help in a scenario like the example I gave.

The last approach I see takes advantage of JGroups. The operation follows the usual approach. After the operation is committed, the leader steps down (this will stop the multicast of AppendEntries to the followers) and starts the voting thread. Once a new leader is elected, it starts the multicast of AppendEntries again, which includes the still-not-applied-in-followers operation removing the previous leader. Since we utilize the multicast from JGroups, the previous leader will still receive the request. We can change the leader handling of append responses to ignore counting non-members. This means the previous leader would still reply to requests outside the Raft member list until the membership operation is applied to the new leader. However, the system becomes unavailable if the old leader shuts before the new leader applies the membership operation!

I would argue that reconfiguration commands are only complete after the followers apply the command to their state machines. I still need to think about it. Do you have any ideas? Hopefully, I am just overthinking. Reconfiguration is a PITA :face_with_head_bandage:.

[1] https://groups.google.com/g/raft-dev/c/t4xj6dJTP6E/m/d2D9LrWRza8J

Feb 04 '24 21:02 jabolina

Applied changes to implement the last option, taking advantage of the multicast. The leader now discards responses from non-members, so it doesn't count towards the majority. My contrived scenario above would fail for any operation, not only membership, and since it was committed, it does not mean data loss.

In addition to the changes, I added some warnings to the documentation. This should be ready now.

Feb 05 '24 23:02 jabolina

I'm pretty swamped with work and won't have time to look into this in the near future. Ask yourself a few questions:

Is this a real bug that affects correctness? Or is it just a performance fix?
How often does this occur? Pretty regularly or only in edge cases?
You mentioned 'tricky' and 'complicated': is the fix itself going to complicate code, a.k.a. introduce new bugs / impact legibility and maintenance of the code?

If you introduce complexity for no obvious gain (correctness! we don't care that much about performance), then I'd not do it. Otherwise, go for it!

Feb 09 '24 08:02 belaban

Thanks, Bela! Thanks for raising the points, which gave me some ideas. Unfortunately, this issue raises a liveness and safety issue. In the current main version, removing the leader doesn't trigger a new leader election, and the (removed) leader can replicate entries (see, a non-raft member at this point!). However, this is a rare enough event. It happens only when an operator executes a reconfiguration to remove the current leader.

I'll try adding a few more tests and ensure the changes are as minimal as possible to reduce the surface for new bugs.

Feb 11 '24 14:02 jabolina