async-raft
async-raft copied to clipboard
Removing Voter and NonVoter nodes
Hi,
I'm in the process of implementing an application based on async-raft, and I ran into a problem regarding cluster membership.
Adding NonVoters and then promoting them to Voters work without any hiccups. However, unfortunately, I fail to understand how node removal works.
When removing a Voter node via change_membership
, the removed node's state is changed to NonVoter, and a replication stream from the cluster Leader still remains. Therefore, when I shut down the removed node (which is, at that point, in NonVoter state), the Leader floods the console with errors regarding timed out RPCs. I tried changing the replication_lag_threshold
and the max_payload_entries
in the hope of the Leader giving up, but I had no luck.
A related problem is the removal of a NonVoter node, which had not been promoted to Voter: I simply don't know how to shut down the replication stream from the Leader to the NonVoter.
Can you please explain what is the correct way of completely removing Voters or NonVoters?
Hello @battila7, thanks for the report. Here is the code responsible for dropping the replication streams: https://github.com/async-raft/async-raft/blob/master/async-raft/src/core/replication.rs#L117-L136
It is certainly possible that there is a bug in this code, though we do have tests covering these cases. To me, this sounds like a timing issue where perhaps the node being removed is shutdown too early, before the leader is able to detect that the config change has been committed. To codify the requirement:
- The node being removed from the cluster should not be shutdown until the Raft leader has committed the configuration change which removes that node from the cluster.
If the node shutsdown as soon as it goes into NonVoter, then the behavior you've described is exactly what I would expect to see. As I said, there certainly could be a bug ... so definitely let me know what you think about the above statements and I'm happy to dig in a bit deeper.
Also, take a look at #101 specifically the bit on updating to the latest specs configuration change protocol. Long-term, I do want to update the protocol to use only the latest version of the spec, as it is conceptually quite a bit more simple. Thoughts?
As an aside, I would like to revisit that error logging statement in the replication stream, because I've seen plenty of times where the console gets flooded with error messages due to similar issues.
Hi, @thedodd , thanks for the quick reply!
Let me walk you through what I do.
First, I spin up Node 0, and create a single-node cluster by calling initialize
on it (term 1, last_log_index 1). Next, I fire up two additional nodes, Node 1 and Node 2, and add both of them as non-voters. After I get the signal that they're all synced up, I add both of them as cluster members, one-by-one, using change_membership
. The metrics are now as follows:
- Node 0:
- State: Leader,
- Term: 1,
- Last log index: 5.
- Node 1:
- State: Follower,
- Term: 1,
- Last log index: 5.
- Node 2:
- State: Follower,
- Term: 1,
- Last log index: 5.
Everything's fine so far. Now let's say that I do a few operations, the same way it would happen in reality. All three logs get updated, everyone's on the same page, which is last log index 10.
Now, let's say, that I want to perform an update on Node 2, and thus want to take it out. I call change_membership
with nodes 0 and 1. The metrics are now as follows:
- Node 0:
- State: Leader,
- Term: 1,
- Last log index: 12
- Node 1:
- State: Follower,
- Term: 1,
- Last log index: 12
- Node 2:
- State: NonVoter,
- Term: 1,
- Last log index: 12
Great, it seems. Now, the code snippet you've linked (in replication.rs
) assumes that state.remove_after_commit
is Some value. If my understanding is correct, state.remove_after_commit
should be set here https://github.com/async-raft/async-raft/blob/master/async-raft/src/core/admin.rs#L254
However, if I debug/log the enclosing handle_uniform_consensus_committed
function, I get that self.nodes
is empty. Thus, state.remove_after_commit
is None, and we're toast, as no matter how many operations I perform later, the replication will never be closed, as it does not know when to close exactly. I was able to consistently reproduce this case.
When this method works correctly, ie. we actually enter the closure passed to filter_map
, then removal does happen.
Please let me know if I'm missing something, which can definitely happen :)
Regarding #101: I'm a big fan of the updates related to membership changes. The AddServer
and RemoveServer
RPCs might offer an easier way for changing cluster members. Also, a more streamlined API for Non-Voters would go a long way.
In our use case, we would most likely want to introduce Non-Voter nodes, just for the sake of replication (some sort of a change data capture solution). Therefore easy addition/removal would help a lot.
Sure, that error logging statement is quite brutal, so turning it down a bit somehow would definitely make sense.
Please let me know if I can help you with any additional debugging or even contributions. As we're planning to go into production with Raft, ensuring its correctness and reliability is key for us!
However, if I debug/log the enclosing handle_uniform_consensus_committed function, I get that self.nodes is empty.
Hmmm. That sounds rather problematic. self.nodes
should not be empty. We have tests covering dynamic cluster membership. Are you able to reproduce this same behavior in a test case?
As far as removing a NonVoter, it should work exactly the same way that removing any other node would work. Perhaps there is an issue in your network layer implementation where the configuration is not being transferred properly. Thoughts?
@battila7 I definitely want to help, just need to gather a little more info. If we can reproduce the error perhaps using one of our tests cases in this repo, that would be a great help!
@thedodd
Thanks for looking into this one again!
@drmingdrmer Sorry for dragging you into this issue, but have you ever experienced something like this?
Here is an ugly test case demonstrating the issue (modified the dynamic membership test a little):
mod fixtures;
use std::sync::Arc;
use std::time::Duration;
use anyhow::Result;
use async_raft::Config;
use futures::stream::StreamExt;
use maplit::hashset;
use tokio::time::sleep;
use fixtures::RaftRouter;
#[tokio::test(flavor = "multi_thread", worker_threads = 6)]
async fn should_not_replicate_to_removed_nodes() -> Result<()> {
fixtures::init_tracing();
// Setup test dependencies.
let config = Arc::new(Config::build("test".into()).validate().expect("failed to build Raft config"));
let router = Arc::new(RaftRouter::new(config.clone()));
router.new_raft_node(0).await;
// Assert all nodes are in non-voter state & have no entries.
sleep(Duration::from_secs(3)).await;
router.assert_pristine_cluster().await;
// Initialize the cluster, then assert that a stable cluster was formed & held.
tracing::info!("--- initializing cluster");
router.initialize_from_single_node(0).await?;
sleep(Duration::from_secs(3)).await;
router.assert_stable_cluster(Some(1), Some(1)).await;
// Sync some new nodes.
router.new_raft_node(1).await;
router.new_raft_node(2).await;
router.new_raft_node(3).await;
router.new_raft_node(4).await;
tracing::info!("--- adding new nodes to cluster");
let mut new_nodes = futures::stream::FuturesUnordered::new();
new_nodes.push(router.add_non_voter(0, 1));
new_nodes.push(router.add_non_voter(0, 2));
new_nodes.push(router.add_non_voter(0, 3));
new_nodes.push(router.add_non_voter(0, 4));
while let Some(inner) = new_nodes.next().await {
inner?;
}
tracing::info!("--- changing cluster config");
router.change_membership(0, hashset![0, 1, 2, 3, 4]).await?;
sleep(Duration::from_secs(5)).await;
router.assert_stable_cluster(Some(1), Some(3)).await; // Still in term 1, so leader is still node 0.
// Send some requests
router.client_request_many(0, "client", 100).await;
// Chill out
sleep(Duration::from_secs(5)).await;
// Remove Node 4
router.change_membership(0, hashset![0, 1, 2, 3]).await?;
let removed_metrics = router.latest_metrics().await
.into_iter()
.find(|e| {
e.id == 4
})
.unwrap();
assert_eq!(removed_metrics.state, async_raft::State::NonVoter, "The removed node should be a Non-Voter after being removed.");
// Send some requests
router.client_request_many(0, "client", 100).await;
// Chill out
sleep(Duration::from_secs(5)).await;
let removed_metrics = router.latest_metrics().await
.into_iter()
.find(|e| {
e.id == 4
})
.unwrap();
let leader_metrics = router.latest_metrics().await
.into_iter()
.find(|e| {
e.id == 0
})
.unwrap();
assert_ne!(removed_metrics.last_applied, leader_metrics.last_applied, "At some point, replication should have stopped.");
Ok(())
}
After 4
is removed, the replication stream should abort, right? However, 4
still got all the changes, despite being out of the cluster. Consequently, if we shut down 4
, then the console will explode with error messages.
@battila7 I cleaned up the snippet and made it a test case that should not pass: #119
@thedodd #119 would be a handy snippet for you to examine what actually happened.
@drmingdrmer Thanks for putting in the effort and creating a cleaned up version! It displays the issue quite well.
@battila7 @battila7
- When a leader established, the replication state is added to
LeaderState.nodes
. - When NonVoter is added, the replication state is added to
LeaderState.non_voters
.
When a membership is committed, only the replication states in nodes
are removed.
But if the leader has never been transferred to other nodes since NonVoter is added, the nodes
is always empty.
Thus the replication to the removed node does not stop.
This is the logging I added and what I got when running remove_voter_non_voter
case:
impl<'a, D: AppData, R: AppDataResponse, N: RaftNetwork<D>, S: RaftStorage<D, R>> LeaderState<'a, D, R, N, S> {
...
pub(super) async fn handle_uniform_consensus_committed(&mut self, index: u64) -> Result<(), RaftError> {
...
+
+ let ids:Vec<u64> = self.nodes.keys().cloned().collect();
+ tracing::info!("nodes: {:?}", ids);
+ tracing::info!("membership: {:?}", self.core.membership);
+ tracing::info!("nodes_to_remove: {:?}", nodes_to_remove);
for node in nodes_to_remove {
May 21 12:49:11.981 INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: membership: MembershipConfig { members: {1, 2, 3, 4, 0}, members_after_consensus: None }
May 21 12:49:11.982 INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: nodes_to_remove: []
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: new
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: enter
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: exit
May 21 12:49:11.982 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}:report_metrics: async_raft::core: close time.busy=178µs time.idle=362µs
May 21 12:49:11.983 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: exit
May 21 12:49:11.983 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=3}: async_raft::core::admin: close time.busy=1.35ms time.idle=240µs
May 21 12:49:12.542 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: new
May 21 12:49:12.542 TRACE main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: enter
May 21 12:49:12.543 INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: nodes: []
May 21 12:49:12.543 INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: membership: MembershipConfig { members: {0, 3, 2, 1}, members_after_consensus: None }
May 21 12:49:12.543 INFO main{id=0 cluster=test}:run{id=0 raft_state="leader"}:handle_uniform_consensus_committed{index=105}: async_raft::core::admin: nodes_to_remove: []
@drmingdrmer
Exactly, this is the behavior I referred to when I wrote that self.nodes
was empty. Correctly handling these two collections would solve the issue.
Unfortunately, I solved this issue alongside with the single-node consensus algorithm, so I don't have an isolated fix. However, it's not a big effort, IIRC.
@battila7 && @drmingdrmer thanks for all of the great discussion and info here. I should have time to review this weekend.