jgroups-raft Handle leader leaving during election procedure

Borrowing ideas from #284. We check whether the elected leader is still present in the current view. If the leader has left, the voting thread continues running, collecting proposals from the new view.

The first commit includes a test for the case the leader left before the coordinator applied the result locally. The test reproduces the issue described in #280.

The third commit contains a check when handling the LeaderElected message. The node verifies if the leader is present in the current view.

Jun 21 '24 01:06 jabolina

Multi-threads problems are very hard to explain, please see my comment, I don't know if it is clear.

protected void runVotingProcess() {
        if (!isMajorityAvailable()) { // TODO access volatile view separately
            if (log.isDebugEnabled())
                log.debug("%s: majority (%d) not available anymore (%s), stopping thread", local_addr, raft.majority(), view);

            // TODO if reach majority at this point, and startVotingThread won't start because it's running.

            stopVotingThread();
            return;
        }

        View electionView = this.view;
        long new_term=raft.createNewTerm();
        votes.reset(electionView.getMembersRaw());
        num_voting_rounds++;
        long start=System.currentTimeMillis();
        sendVoteRequest(new_term);

        votes.waitForAllResponses(vote_timeout);
        long time=System.currentTimeMillis()-start;

        int majority=raft.majority();
        if(votes.numberOfValidResponses() >= majority) {

            // TODO At this point this.view could be any thing even a single node view, the elected leader could be unqualified(Don't have the longest log for example)
            Address leader=determineLeader(); // TODO access volatile view separately

            log.trace("%s: collected votes from %s in %d ms (majority=%d) -> leader is %s (new_term=%d)",
                    local_addr, votes.getResults(), time, majority, leader, new_term);

            raft.setLeaderAndTerm(leader, new_term);
            sendLeaderElectedMessage(leader, new_term);

            synchronized (this) {
                if (isLeaderInView(leader, view)) // TODO without checking the majority if reached
                    stopVotingThread();
                else if (log.isTraceEnabled())
                    log.trace("%s: leader (%s) not in view anymore, retrying", local_addr, leader);
            }
        }
        else
            log.trace("%s: collected votes from %s in %d ms (majority=%d); starting another voting round",
                    local_addr, votes.getValidResults(), time, majority);
    }

    protected void handleView(View v) {
        // TODO get the leader before assign the new view for the potential running voting thread
        // TODO assume there is running voting thread got the elected leader (not in the new view) and haven't applied it to raft yet
        Majority result=Utils.computeMajority(view, v, raft().majority(), raft.leader());

        log.debug("%s: existing view: %s, new view: %s, result: %s", local_addr, this.view, v, result);
        List<Address> joiners=View.newMembers(this.view, v);
        boolean has_new_members=joiners != null && !joiners.isEmpty();
        boolean coordinatorChanged = Utils.viewCoordinatorChanged(this.view, v);

        // TODO before assign the new view, voting thread didn't detect the leader has left, in the meantime the result of computeMajority is "no_change"
        this.view=v;

        switch(result) {
            case no_change:
                // the leader resends its term/address for new members to set the term/leader.
                if(raft.isLeader() && has_new_members)
                    sendLeaderElectedMessage(raft.leader(), raft.currentTerm());

                // Handle cases where the previous coordinator left *before* a leader was elected.
                // See: https://github.com/jgroups-extras/jgroups-raft/issues/259
                else if (coordinatorChanged && isViewCoordinator() && isMajorityAvailable() && raft.leader() == null)
                    startVotingThread();
                break;
            case reached:
            case leader_lost:
                // In case the leader is lost, we stop everything *before* starting again.
                // This avoids cases where the leader is lost before the voting mechanism has stopped.
                // See: https://github.com/jgroups-extras/jgroups-raft/issues/259
                if(isViewCoordinator()) {
                    log.trace("%s: starting voting process (reason: %s, view: %s)", local_addr, result, view);
                    startVotingThread();
                }
                break;
            case lost:
                stopVotingThread(); // TODO unstable
                raft.setLeaderAndTerm(null);
                break;
        }
    }

Jun 21 '24 08:06 yfei-z

About unset the leader under lost majority case, please check my comments: https://github.com/jgroups-extras/jgroups-raft/issues/280#issuecomment-2177641988 and https://github.com/jgroups-extras/jgroups-raft/pull/284#issuecomment-2180004605

Jun 21 '24 08:06 yfei-z

@yfei-z, amazing work! I'll try to get some tests for the comments this weekend. If the work becomes "patching holes", I believe we're better off serializing the handling of view updates. But I have a few ideas I want to try once the tests are in place.

Jun 21 '24 16:06 jabolina

I think my pull request has resolved them all

Jun 24 '24 02:06 yfei-z

I've pushed with some more tests and covered the comments.

Instead of checking for the majority at the beginning of the voting thread, we let the concrete implementation handle it, and check if the thread is interrupted to exit earlier;
After determining the leader from the collected votes, we acquire the intrinsic lock (other threads can not start/stop the voting thread). Holding the lock, we check if the view still has a majority (if not stop, and set leader null) and if the leader is still in view (if so, stop the voting thread). Otherwise, the voting thread keeps running, until the leader is elected or the thread is interrupted.
Check the view before accepting the LeaderElected message.

In point 1, I added that check thinking it would be a safeguard, but it brought more difficulty. We could even remove the interrupt check. In point 2, as long as we have the majority, we'll elect a qualified leader. The majority guarantees there is an overlap with committed entries.

Let me know what you think. After this is done, we can look into the other issue with LevelDB.

Jun 30 '24 21:06 jabolina

@jabolina

    protected void runVotingProcess() {
        if (Thread.interrupted()) return; // TODO Right after the running status has be checked, not so useful.

        View electionView = this.view;

        // TODO No majority check, may initiate an unnecessary invalid voting round

        long new_term=raft.createNewTerm();
        votes.reset(electionView.getMembersRaw());
        num_voting_rounds++;
        long start=System.currentTimeMillis();
        sendVoteRequest(new_term);

        votes.waitForAllResponses(vote_timeout); // TODO It can't be interrupted before timeout. votes.reset() could end the waiting but can't ensure the thread is terminated.
        long time=System.currentTimeMillis()-start;

        int majority=raft.majority();
        if(votes.numberOfValidResponses() >= majority) {
            Address leader=determineLeader(); // TODO Access view separately, may choose an unqualified leader
            log.trace("%s: collected votes from %s in %d ms (majority=%d) -> leader is %s (new_term=%d)",
                    local_addr, votes.getResults(), time, majority, leader, new_term);

            raft.setLeaderAndTerm(leader, new_term);
            sendLeaderElectedMessage(leader, new_term); // send to all - self

            synchronized (this) {
                if (!isMajorityAvailable()) {
                    log.trace("%s: majority lost (%s) before elected (%s)", local_addr, view, leader);
                    stopVotingThread(); // TODO Self stop and join, always wait until timeout.
                    raft.setLeaderAndTerm(null);
                    return;
                }

                if (isLeaderInView(leader, view)) {
                    stopVotingThread();
                    return;
                }

                if (log.isTraceEnabled())
                    log.trace("%s: leader (%s) not in view anymore, retrying", local_addr, leader);
            }
        } else if (log.isTraceEnabled())
            log.trace("%s: collected votes from %s in %d ms (majority=%d); starting another voting round",
                    local_addr, votes.getValidResults(), time, majority);
    }


    protected void handleView(View v) {
        View previousView = this.view;
        this.view = v;
        Majority result=Utils.computeMajority(previousView, v, raft().majority(), raft.leader());
        log.debug("%s: existing view: %s, new view: %s, result: %s", local_addr, previousView, v, result);
        List<Address> joiners=View.newMembers(previousView, v);
        boolean has_new_members=joiners != null && !joiners.isEmpty();
        boolean coordinatorChanged = Utils.viewCoordinatorChanged(previousView, v);
        switch(result) {
            case no_change:
                if(raft.isLeader() && has_new_members)
                    sendLeaderElectedMessage(raft.leader(), raft.currentTerm());

                else if (coordinatorChanged && isViewCoordinator() && isMajorityAvailable() && raft.leader() == null)
                    startVotingThread();
                break;
            case reached:
            case leader_lost:
                if(isViewCoordinator()) {
                    log.trace("%s: starting voting process (reason: %s, view: %s)", local_addr, result, view);
                    startVotingThread();
                }
                break;
            case lost:
                // TODO The VIEW message will be sent after this in coordinator, and the VIEW message will trigger the
                // TODO view change event and come to here again in participants to unset the leader.
                // TODO Assuming the stopVotingThread is timeout, then the VIEW message is sent, and the running voting thread
                // TODO send the leader after the VIEW message, the participant will unset the leader first then set the false leader by leader message
                stopVotingThread(); // TODO Need to make sure it is stopped (leader message won't be sent after this)
                raft.setLeaderAndTerm(null);
                break;
        }
    }

Jul 05 '24 10:07 yfei-z

Sorry for the delay, I was a bit sidetracked. I've pushed one last change with some tests. This should be handling all of the problems above.

The Thread.interrupted() test is just a best-effort optimization to identify when the thread is interrupted somewhere. In safety terms, there are no problems in case of unnecessary terms and the coordinator (or followers) have a higher term. Participants might be completely unaware of terms in the cluster.
The BaseElection#determineLeader method utilizing the class view instead of the election view also causes no problem. As long the coordinator receives a majority of responses, it should not be possible to elect an unqualified leader violating the election restriction. If the cluster loses the majority of members, we verify it while holding the intrinsic lock. We stop the thread and clear everything if we lose the majority, and the thread continues running if only the leader has left.
The LeaderElected message should be handled. We utilize the view to verify if the installation is permitted. If the node accepts the message before a new view, it should be OK due to the handleView method. If a new view is installed, and later the node receives the message, we verify if the leader is still in the view with a majority in place. This should avoid installing the leader of a previous term later in time.

This should be handling all the issues mentioned.

Jul 28 '24 15:07 jabolina

Verifying the view and leader when participant nodes received the LeaderElected message seems to have solved all problems, participant nodes could handle the election result correctly no matter LeaderElected and VIEW message who arrive first. At first I was a little concerned that the verification in handleLeaderElected was not atomic, it could goes wrong if VIEW message and LeaderElected message are processed concurrently when the TP.message_processing_policy is "submit", but seems NAKACK2 could detect the existing processing thread and make sure those messages will be processed serially.

Aug 08 '24 04:08 yfei-z

Yeah, I counted on the message and view not being delivered concurrently. IIRC, it comes from virtual synchrony. Glad you're able to test it, many thanks! I have some work to finish on my side, once that's done I'll start looking into #281 before a new release.

Aug 08 '24 13:08 jabolina