openraft icon indicating copy to clipboard operation
openraft copied to clipboard

Troubleshoot a Higher Vote Error

Open fredfortier opened this issue 3 years ago • 22 comments

I recently upgraded from an earlier alpha to v0.6.4. I worked out most of the kinks, but I'm seeing an error like this in my integration tests after a new node joined the cluster.

 116.603607178s ERROR ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603669397s  WARN ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603761513s  INFO ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: ddx_core::membership: close time.busy=28.8ms time.idle=2.34s
 116.603844639s  INFO ThreadId(03) Node{id=0}:run_raft{node_id=0}: openraft::core: leader recv from replication_rx: "RevertToFollower: target: 2, vote: vote:2-2"

Do you have a rough idea of what could be causing this to help me troubleshoot? I want to configure the cluster to be as stable when possible (err on the side of no reelection unless the leader goes offline longer than the timeout).

fredfortier avatar Feb 19 '22 00:02 fredfortier

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

github-actions[bot] avatar Feb 19 '22 00:02 github-actions[bot]

Hi @fredfortier. I will try to help you but I am not the best one. Could you please list what steps you did to configure the cluster and for how long the cluster keeps running stable? Also what's the size of the logs and what server/client is it using?

ppamorim avatar Feb 19 '22 02:02 ppamorim

It seems like node-2 entered the Candidate state and raised its term to elect itself.

Debug level logs will show more detail about what happened, e.g., something like timeout to recv a event, change to CandidateState. Do you have a debug level log to find out why this happened?

@ppamorim had a similar issue as yours. The replication to one node paused for some reason(too large a payload and skewed clock time on different nodes), with the latest branch main.

drmingdrmer avatar Feb 19 '22 04:02 drmingdrmer

@fredfortier what system(s) are you testing the cluster and could you try to open https://time.is on all machines to check the time diff? It should not be more than or equals 1s.

ppamorim avatar Feb 19 '22 04:02 ppamorim

@ppamorim This is in context of an integration test where nodes share the same Tokio runtime. I'll grab some debug logs. I haven't seen this yet when testing with a Docker-based cluster with independent nodes. Thanks for the pointers!

fredfortier avatar Feb 19 '22 18:02 fredfortier

@fredfortier Possibly it's the payload size causing the node to cavitate and get elected out.

ppamorim avatar Feb 19 '22 18:02 ppamorim

Do you mean the snapshot payload? My snapshots are large and I did have problems in earlier versions with the streaming apparatus (mismatching offsets type errors).

fredfortier avatar Feb 19 '22 18:02 fredfortier

@drmingdrmer I see the message you're talking about. I believe all nodes are using the same clock unless there's a Tokio issue I'm not aware of.

 117.522918280s DEBUG ThreadId(03) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="follower"}: openraft::core: timeout to recv a event, change to CandidateState
 117.522971974s DEBUG ThreadId(03) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="follower"}: openraft::core: set_target_state id=1 target_state=Candidate
 117.523014127s DEBUG ThreadId(03) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="follower"}: openraft::core: close time.busy=175ms time.idle=11.5s
 117.523112594s DEBUG ThreadId(03) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="candidate"}: openraft::core: update election timeout after: 1.164s
 117.524571392s  INFO ThreadId(02) send_next_chunk: local_network: close time.busy=355ms time.idle=1.07s
 117.525535330s DEBUG ThreadId(05) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="candidate"}:current_leader: openraft::core: close time.busy=6.95µs time.idle=12.2µs
 117.525627465s DEBUG ThreadId(05) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="candidate"}: openraft::core: report_metrics: Metrics{id:1,Candidate, term:2, last_log:Some(261), last_applied:Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 261 }), leader:None, membership:{log_id:1-0-253 membership:members:[{0, 1, 2}],learners:[]}, snapshot:Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 199 }), replication:
 117.525728429s DEBUG ThreadId(05) Node{id=1}:run_raft{node_id=1}:new{id=1 cluster=test}:run{id=1 raft_state="candidate"}:handle_vote_response{target=1}: openraft::core::vote: recv vote response res=VoteResponse { vote: Vote { term: 2, node_id: 1, committed: false }, vote_granted: true, last_log_id: Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 261 }) } target=1

This happens after node2 joined the cluster, but looks like it's node1 that proposes itself as candidate. Could the leader become unresponsize while installing a snapshot? Or, could doing log compaction cause nodes to become unresponsive? What would be the best way to rule out the payload too large issue?

fredfortier avatar Feb 19 '22 19:02 fredfortier

@fredfortier No I mean the payload from AppendEntriesRequest. You can verify the amount of payload in the function send_append_entries(...). Could you please test this configuration below?

let config = openraft::Config {
    election_timeout_max: 1000,
    max_payload_entries: 1,
    ..Default::default()
  };

Please do not use this configuration in production, it's just to test your use case.

ppamorim avatar Feb 19 '22 19:02 ppamorim

@ppamorim I tried using the config above but got the same error:

 117.934701344s DEBUG ThreadId(05) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries: openraft::replication: append_entries resp: AppendEntriesResponse { vote: Vote { term: 2, node_id: 2, committed: false }, success: false, conflict: false }
 117.934758794s DEBUG ThreadId(05) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries: openraft::replication: append entries failed, reverting to follower append_resp.vote=vote:2-2
 117.934811256s DEBUG ThreadId(05) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries: openraft::replication: close time.busy=1.32ms time.idle=9.04ms
 117.934859480s ERROR ThreadId(05) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 117.934912659s  WARN ThreadId(05) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 117.935075693s  INFO ThreadId(05) Node{id=0}:run_raft{node_id=0}:new{id=0 cluster=foo}:run{id=0 raft_state="leader"}:leader_loop{id=0}: openraft::core: leader recv from replication_rx: "RevertToFollower: target: 2, vote: vote:2-2"
 117.936250829s  WARN ThreadId(06) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=1}:call_core:handle_msg{state="leader" id=0}:add_learner{target=1 blocking=true}:spawn_replication_stream{target=1}:line_rate_loop:CHrx:LineRate: ddx_core::api::routes::raft: Received vote node_id=1
 117.936640226s DEBUG ThreadId(06) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=1}:call_core:handle_msg{state="leader" id=0}:add_learner{target=1 blocking=true}:spawn_replication_stream{target=1}:line_rate_loop:CHrx:LineRate:vote{rpc=vote:2-2, last_log:Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 266 })}:call_core:handle_msg{state="follower" id=1}: openraft::core: recv from rx_api: RequestVote: vote:2-2, last_log:Some(LogId { leader_id: LeaderId { term: 1, node_id: 0 }, index: 266 })

I can confirm that the payload size is always either zero or one.

 117.923934566s  INFO ThreadId(06) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries:do_send_append_entries{target=2 payloads=1}: ddx_core::raft::network: close time.busy=576µs time.idle=50.7ms
 117.923991687s DEBUG ThreadId(04) Node{id=0}:run_raft{node_id=0}:new{id=0 cluster=foo}:run{id=0 raft_state="leader"}:leader_loop{id=0}:CHrx:LeaderState:call_core:handle_msg{state="candidate" id=2}: openraft::core: close time.busy=332µs time.idle=18.6µs
 117.924009767s  WARN ThreadId(06) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries: openraft::replication: timeout while sending AppendEntries RPC to target error=deadline has elapsed
 117.924060708s DEBUG ThreadId(04) Node{id=0}:run_raft{node_id=0}:new{id=0 cluster=foo}:run{id=0 raft_state="leader"}:leader_loop{id=0}:CHrx:LeaderState:call_core: openraft::raft: close time.busy=502µs time.idle=777ms
 117.924089421s DEBUG ThreadId(06) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership:add_learner{blocking=true target=2}:call_core:handle_msg{state="leader" id=0}:add_learner{target=2 blocking=true}:spawn_replication_stream{target=2}:line_rate_loop:send_append_entries: openraft::replication: close time.busy=1.12ms time.idle=50.6ms
 117.924123245s DEBUG ThreadId(04) Node{id=2}:run_raft{node_id=2}:new{id=2 cluster=foo}:run{id=2 raft_state="candidate"}:CHrx:CandidateState: openraft::core: close time.busy=538µs time.idle=17.0µs

Do you see anything strange or out-of-order in the logs above? It's adding node2 as a learner but timing out after sending it an append entry. Node2 is joining with a pristine state, like node1 did successfully before. Should the leader be appending entries before installing the snapshot?

fredfortier avatar Feb 19 '22 19:02 fredfortier

@fredfortier Would be possible to create a small demo that can reproduce the issue?

ppamorim avatar Feb 19 '22 20:02 ppamorim

Yes, I'll keep on troubleshooting and do that if I can't resolve this issue. Thanks for all the pointers!

fredfortier avatar Feb 19 '22 20:02 fredfortier

Assuming a user/implementation issue. I'll reopen if I can reproduce in isolation.

fredfortier avatar Feb 19 '22 20:02 fredfortier

I think I figured it out. Basically, after a node joins the cluster, there's a warm up period in which the state is locked. During this time, the follower node waits for the state to unlock to respond to append_entry heartbeats. If this takes longer than the re-election timeout, a re-election is triggered.

Assuming this is expected behavior, what's the correct way to temporarily bring a node offline without triggering a re-election? Just respond to heartbeats with errors before the election timeout kicks in?

fredfortier avatar Feb 20 '22 00:02 fredfortier

Could the leader become unresponsize while installing a snapshot? Or, could doing log compaction cause nodes to become unresponsive? What would be the best way to rule out the payload too large issue?

@fredfortier Yes, installing snapshot blocks the RaftCore, which is single-threaded.

Log compaction is done in another tokio task. It won't block RaftCore, unless the implementation of RaftStorage blocks, then RaftStorage may block RaftCore.

I'm gonna add a config to limit the size of a single payload.

drmingdrmer avatar Feb 20 '22 02:02 drmingdrmer

I think I figured it out. Basically, after a node joins the cluster, there's a warm up period in which the state is locked. During this time, the follower node waits for the state to unlock to respond to append_entry heartbeats. If this takes longer than the re-election timeout, a re-election is triggered.

@fredfortier What's the state you mentioned? Do you mean the state machine?

Assuming this is expected behavior, what's the correct way to temporarily bring a node offline without triggering a re-election? Just respond to heartbeats with errors before the election timeout kicks in?

@fredfortier If RaftCore starts to run, there is no way to prevent it from entering the Candidate state.

If the warm-up is to warm up the store, is it possible to warm up the store before using it?

drmingdrmer avatar Feb 20 '22 02:02 drmingdrmer

If the warm-up is to warm up the store, is it possible to warm up the store before using it?

The reason for locking the state machine is at the application layer, outside of the RaftStorage implementation. Basically, it goes offline while replaying transactions. I can implement differently to avoid this offline period, but more generally, a follower may go offline for other reasons (e.g. no network access). In this case, the append_entry heartbeats sent by the leader will fail. I expect the leader to catch up the follower when returning online, but not for other followers to trigger a re-election. Please let me know if I'm misunderstand the basic premise.

Yes, installing snapshot blocks the RaftCore, which is single-threaded.

Do you mean RaftCore in the learner/follower node catching up, or for in leader node serving the snapshot? I assume the former but the latter would definitely be an issue.

It won't block RaftCore, unless the implementation of RaftStorage blocks, then RaftStorage may block RaftCore.

I don't have any blocking code in the RaftStorage implementation so it shouldn't be an issue.

fredfortier avatar Feb 20 '22 22:02 fredfortier

In this case, the append_entry heartbeats sent by the leader will fail. I expect the leader to catch up the follower when returning online, but not for other followers to trigger a re-election. Please let me know if I'm misunderstand the basic premise.

@fredfortier A crashed follower going online is totally OK. But a blocking RaftStorage is a different story: The follower created a timer when started to run, then blocked on RaftStorage, then it found the timer was triggered, as that time, the follower was believing the leader has gone.

Maybe there is a solution to this problem: When an election timeout event and an append-entries event are both triggered, a follower should only deal with the latter one. Let me see.

Do you mean RaftCore in the learner/follower node catching up, or for in leader node serving the snapshot? I assume the former but the latter would definitely be an issue.

Yes, I mean when installing the snapshot on the follower/learner side. The leader will send the snapshot data in another tokio task and will never block RaftCore.

drmingdrmer avatar Feb 21 '22 01:02 drmingdrmer

What I did intuitively is just raise an error as the follower.


pub(super) async fn append_handler(
    &self,
    rpc: AppendEntriesRequest<SequencedRequest>,
) -> Result<impl Reply, Rejection> {
    if self.is_locked() {
        return Err(Rejection::msg("I'm too busy to respond now - Keep sending heatbeats"));
    }
    raft.append_entries(rpc).await.map(Reply::from).map_err(Rejection::msg)
}

On the leader side, I assumed this is equivalent to them follower loosing its network connection.

async fn send_append_entries(
    &self,
    target: u64,
    rpc: AppendEntriesRequest<SequencedRequest>,
) -> Result<AppendEntriesResponse, RPCError<AppendEntriesError>> {
    self.do_send_append_entries(target, rpc).await.map_err(|e| {
        RemoteError::new(
            target,
            // NOTE: Flagging as write error because it pertains to the remote node.
            StorageError::from(StorageIOError::new(
                ErrorSubject::Logs,
                ErrorVerb::Write,
                e.into(),
            ))
            .into(),
        )
        .into()
    })
}

The do_send_append_entries method sends the RPC call to the replication channel via HTTP. As a minimal implementation, I just remap any HTTP error, whether 404 server offline or 500 busy errors to this particular StorageIOError type. I just guessed this error type and it may well be causing the RaftStorage reelection behavior. Is it just a matter of raising the correct error type?

fredfortier avatar Feb 21 '22 17:02 fredfortier

I looked into the example-raft-kv and adjusted my RaftNetwork impl to the following:

    async fn send_append_entries(
        &self,
        target: u64,
        rpc: AppendEntriesRequest<SequencedRequest>,
    ) -> Result<AppendEntriesResponse, RPCError<AppendEntriesError>> {
        self.do_send_append_entries(target, rpc)
            .await
            // See `openraft/example-raft-kv` for usage example.
            .map_err(|e| {
                RPCError::Network(NetworkError::new(&RemoteError::new(
                    target,
                    StorageError::from(StorageIOError::new(
                        ErrorSubject::StateMachine,
                        ErrorVerb::Write,
                        e.into(),
                    )),
                )))
            })
    }

I assume that NetworkError produces a different behavior than StorageError. Please let me know if I'm misunderstanding the intent of these error types. This is a naive implementation, but since StorageError is serializable, I assume the intended implementation would do RemoteError::new(target, response.err.deserialize()?) to echo the internal follower error.

fredfortier avatar Feb 21 '22 21:02 fredfortier

A StorageError means the server has to shut down at once, to prevent a further data damage.

And mapping a remote StorageError to a NetworkError will confuse its caller: an error that happened on a remote peer and an error that happened on transmission should not include each other.

drmingdrmer avatar Feb 22 '22 12:02 drmingdrmer

@fredfortier Assumes the openraft should allow the RaftStorage to block for a long time, then it is openraft's fault: It should update the next-election-timeout after after calling RaftStorage, instead of before it. This way the delay won't cause a timeout in the next loop: https://github.com/datafuselabs/openraft/blob/2055a236682f1456f453677676e32f78a43f1cbe/openraft/src/core/mod.rs#L1048-L1069

drmingdrmer avatar Feb 22 '22 12:02 drmingdrmer