async-raft icon indicating copy to clipboard operation
async-raft copied to clipboard

Removing Voter and NonVoter nodes

Open battila7 opened this issue 3 years ago • 9 comments

Hi,

I'm in the process of implementing an application based on async-raft, and I ran into a problem regarding cluster membership.

Adding NonVoters and then promoting them to Voters work without any hiccups. However, unfortunately, I fail to understand how node removal works.

When removing a Voter node via change_membership, the removed node's state is changed to NonVoter, and a replication stream from the cluster Leader still remains. Therefore, when I shut down the removed node (which is, at that point, in NonVoter state), the Leader floods the console with errors regarding timed out RPCs. I tried changing the replication_lag_threshold and the max_payload_entries in the hope of the Leader giving up, but I had no luck. A related problem is the removal of a NonVoter node, which had not been promoted to Voter: I simply don't know how to shut down the replication stream from the Leader to the NonVoter.

Can you please explain what is the correct way of completely removing Voters or NonVoters?

battila7 avatar Apr 17 '21 19:04 battila7

Hello @battila7, thanks for the report. Here is the code responsible for dropping the replication streams: https://github.com/async-raft/async-raft/blob/master/async-raft/src/core/replication.rs#L117-L136

It is certainly possible that there is a bug in this code, though we do have tests covering these cases. To me, this sounds like a timing issue where perhaps the node being removed is shutdown too early, before the leader is able to detect that the config change has been committed. To codify the requirement:

  • The node being removed from the cluster should not be shutdown until the Raft leader has committed the configuration change which removes that node from the cluster.

If the node shutsdown as soon as it goes into NonVoter, then the behavior you've described is exactly what I would expect to see. As I said, there certainly could be a bug ... so definitely let me know what you think about the above statements and I'm happy to dig in a bit deeper.

Also, take a look at #101 specifically the bit on updating to the latest specs configuration change protocol. Long-term, I do want to update the protocol to use only the latest version of the spec, as it is conceptually quite a bit more simple. Thoughts?


As an aside, I would like to revisit that error logging statement in the replication stream, because I've seen plenty of times where the console gets flooded with error messages due to similar issues.

thedodd avatar Apr 19 '21 18:04 thedodd

Hi, @thedodd , thanks for the quick reply!

Let me walk you through what I do.

First, I spin up Node 0, and create a single-node cluster by calling initialize on it (term 1, last_log_index 1). Next, I fire up two additional nodes, Node 1 and Node 2, and add both of them as non-voters. After I get the signal that they're all synced up, I add both of them as cluster members, one-by-one, using change_membership. The metrics are now as follows:

  • Node 0:
    • State: Leader,
    • Term: 1,
    • Last log index: 5.
  • Node 1:
    • State: Follower,
    • Term: 1,
    • Last log index: 5.
  • Node 2:
    • State: Follower,
    • Term: 1,
    • Last log index: 5.

Everything's fine so far. Now let's say that I do a few operations, the same way it would happen in reality. All three logs get updated, everyone's on the same page, which is last log index 10.

Now, let's say, that I want to perform an update on Node 2, and thus want to take it out. I call change_membership with nodes 0 and 1. The metrics are now as follows:

  • Node 0:
    • State: Leader,
    • Term: 1,
    • Last log index: 12
  • Node 1:
    • State: Follower,
    • Term: 1,
    • Last log index: 12
  • Node 2:
    • State: NonVoter,
    • Term: 1,
    • Last log index: 12

Great, it seems. Now, the code snippet you've linked (in replication.rs) assumes that state.remove_after_commit is Some value. If my understanding is correct, state.remove_after_commit should be set here https://github.com/async-raft/async-raft/blob/master/async-raft/src/core/admin.rs#L254

However, if I debug/log the enclosing handle_uniform_consensus_committed function, I get that self.nodes is empty. Thus, state.remove_after_commit is None, and we're toast, as no matter how many operations I perform later, the replication will never be closed, as it does not know when to close exactly. I was able to consistently reproduce this case.

When this method works correctly, ie. we actually enter the closure passed to filter_map, then removal does happen.

Please let me know if I'm missing something, which can definitely happen :)


Regarding #101: I'm a big fan of the updates related to membership changes. The AddServer and RemoveServer RPCs might offer an easier way for changing cluster members. Also, a more streamlined API for Non-Voters would go a long way. In our use case, we would most likely want to introduce Non-Voter nodes, just for the sake of replication (some sort of a change data capture solution). Therefore easy addition/removal would help a lot.


Sure, that error logging statement is quite brutal, so turning it down a bit somehow would definitely make sense.


Please let me know if I can help you with any additional debugging or even contributions. As we're planning to go into production with Raft, ensuring its correctness and reliability is key for us!

battila7 avatar Apr 20 '21 11:04 battila7

However, if I debug/log the enclosing handle_uniform_consensus_committed function, I get that self.nodes is empty.

Hmmm. That sounds rather problematic. self.nodes should not be empty. We have tests covering dynamic cluster membership. Are you able to reproduce this same behavior in a test case?

As far as removing a NonVoter, it should work exactly the same way that removing any other node would work. Perhaps there is an issue in your network layer implementation where the configuration is not being transferred properly. Thoughts?

@battila7 I definitely want to help, just need to gather a little more info. If we can reproduce the error perhaps using one of our tests cases in this repo, that would be a great help!

thedodd avatar May 19 '21 04:05 thedodd

@thedodd

Thanks for looking into this one again!

@drmingdrmer Sorry for dragging you into this issue, but have you ever experienced something like this?

Here is an ugly test case demonstrating the issue (modified the dynamic membership test a little):

mod fixtures;

use std::sync::Arc;
use std::time::Duration;

use anyhow::Result;
use async_raft::Config;
use futures::stream::StreamExt;
use maplit::hashset;
use tokio::time::sleep;

use fixtures::RaftRouter;

#[tokio::test(flavor = "multi_thread", worker_threads = 6)]
async fn should_not_replicate_to_removed_nodes() -> Result<()> {
    fixtures::init_tracing();

    // Setup test dependencies.
    let config = Arc::new(Config::build("test".into()).validate().expect("failed to build Raft config"));
    let router = Arc::new(RaftRouter::new(config.clone()));
    router.new_raft_node(0).await;

    // Assert all nodes are in non-voter state & have no entries.
    sleep(Duration::from_secs(3)).await;
    router.assert_pristine_cluster().await;

    // Initialize the cluster, then assert that a stable cluster was formed & held.
    tracing::info!("--- initializing cluster");
    router.initialize_from_single_node(0).await?;
    sleep(Duration::from_secs(3)).await;
    router.assert_stable_cluster(Some(1), Some(1)).await;

    // Sync some new nodes.
    router.new_raft_node(1).await;
    router.new_raft_node(2).await;
    router.new_raft_node(3).await;
    router.new_raft_node(4).await;
    tracing::info!("--- adding new nodes to cluster");
    let mut new_nodes = futures::stream::FuturesUnordered::new();
    new_nodes.push(router.add_non_voter(0, 1));
    new_nodes.push(router.add_non_voter(0, 2));
    new_nodes.push(router.add_non_voter(0, 3));
    new_nodes.push(router.add_non_voter(0, 4));
    while let Some(inner) = new_nodes.next().await {
        inner?;
    }
    tracing::info!("--- changing cluster config");
    router.change_membership(0, hashset![0, 1, 2, 3, 4]).await?;
    sleep(Duration::from_secs(5)).await;
    router.assert_stable_cluster(Some(1), Some(3)).await; // Still in term 1, so leader is still node 0.

    // Send some requests
    router.client_request_many(0, "client", 100).await;
    // Chill out
    sleep(Duration::from_secs(5)).await;

    // Remove Node 4
    router.change_membership(0, hashset![0, 1, 2, 3]).await?;
    let removed_metrics = router.latest_metrics().await
        .into_iter()
        .find(|e| {
            e.id == 4
        })
        .unwrap();
    assert_eq!(removed_metrics.state, async_raft::State::NonVoter, "The removed node should be a Non-Voter after being removed.");
    // Send some requests
    router.client_request_many(0, "client", 100).await;
    // Chill out
    sleep(Duration::from_secs(5)).await;

    let removed_metrics = router.latest_metrics().await
        .into_iter()
        .find(|e| {
            e.id == 4
        })
        .unwrap();

    let leader_metrics = router.latest_metrics().await
        .into_iter()
        .find(|e| {
            e.id == 0
        })
        .unwrap();

    assert_ne!(removed_metrics.last_applied, leader_metrics.last_applied, "At some point, replication should have stopped.");

    Ok(())
}

After 4 is removed, the replication stream should abort, right? However, 4 still got all the changes, despite being out of the cluster. Consequently, if we shut down 4, then the console will explode with error messages.

battila7 avatar May 19 '21 15:05 battila7

@battila7 I cleaned up the snippet and made it a test case that should not pass: #119

@thedodd #119 would be a handy snippet for you to examine what actually happened.

drmingdrmer avatar May 20 '21 16:05 drmingdrmer

@drmingdrmer Thanks for putting in the effort and creating a cleaned up version! It displays the issue quite well.

battila7 avatar May 20 '21 17:05 battila7

@battila7 @battila7

  • When a leader established, the replication state is added to LeaderState.nodes.
  • When NonVoter is added, the replication state is added to LeaderState.non_voters.

When a membership is committed, only the replication states in nodes are removed. But if the leader has never been transferred to other nodes since NonVoter is added, the nodes is always empty. Thus the replication to the removed node does not stop.

This is the logging I added and what I got when running remove_voter_non_voter case:

impl<'a, D: AppData, R: AppDataResponse, N: RaftNetwork<D>, S: RaftStorage<D, R>> LeaderState<'a, D, R, N, S> {

...

pub(super) async fn handle_uniform_consensus_committed(&mut self, index: u64) -> Result<(), RaftError> {

...

+
+        let ids:Vec<u64> = self.nodes.keys().cloned().collect();
+        tracing::info!("nodes: {:?}", ids);
+        tracing::info!("membership: {:?}", self.core.membership);
+        tracing::info!("nodes_to_remove: {:?}", nodes_to_remove);
         for node in nodes_to_remove {
May 21 12:49:11.981  INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: membership: MembershipConfig { members: {1, 2, 3, 4, 0}, members_after_consensus: None }
May 21 12:49:11.982  INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: nodes_to_remove: []
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: new
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: enter
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: exit
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: close time.busy=178µs time.idle=362µs
May 21 12:49:11.983 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: exit
May 21 12:49:11.983 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: close time.busy=1.35ms time.idle=240µs
May 21 12:49:12.542 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: new
May 21 12:49:12.542 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: enter
May 21 12:49:12.543  INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: nodes: []
May 21 12:49:12.543  INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: membership: MembershipConfig { members: {0, 3, 2, 1}, members_after_consensus: None }
May 21 12:49:12.543  INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: nodes_to_remove: []

drmingdrmer avatar May 21 '21 05:05 drmingdrmer

@drmingdrmer

Exactly, this is the behavior I referred to when I wrote that self.nodes was empty. Correctly handling these two collections would solve the issue.

Unfortunately, I solved this issue alongside with the single-node consensus algorithm, so I don't have an isolated fix. However, it's not a big effort, IIRC.

battila7 avatar May 21 '21 11:05 battila7

@battila7 && @drmingdrmer thanks for all of the great discussion and info here. I should have time to review this weekend.

thedodd avatar May 21 '21 13:05 thedodd