ra
ra copied to clipboard
There is a possibility that the leader is never elected even if the majority of the members are alive
Describe the bug
Let me report an issue we encountered while operating our service that uses ra
.
We were operating a 7-node cluster and stopped 3 of them for maintenance. After stopping the 3 nodes, our service became unavailable due to the absence of the Raft leader. It seemed that leader elections were executed periodically, but a new leader was never elected until we restarted member nodes.
I think that, as shown in the following reproduction steps section, this is a subtle bug relating to the pre_vate
state (ra
original state which is not defined in the Raft paper), and how to fix this is not immediately obvious. Therefore, I think it would be better to leave the resolution of this issue to the ra
dev team. However, since this is a critical issue for us, I am willing to create a PR if ra
team does not have enough resources to address this issue.
Reproduction steps
Simplified scenario where this issue could occur
I guess a scenario like the following occurred:
- There is a
ra
cluster consists of 3 members nameda
,b
, andc
-
c
is the leader with termN
and log indexM
(whereN
andM
are arbitrary integers) -
a
andb
are infollower
state
-
- For some reason,
a
transitions topre_vote
state:-
a
broadcasts#pre_vote_rpc{ term = N }
-
b
replies#pre_vote_result{ term = N, vote_granted = true }
toa
-
a
transitions tocandidate
state with termN + 1
-
a
broadcasts#request_vote_rpc{ term = N + 1 }
-
-
c
processes a command:-
c
increases local log index toM + 1
, and broadcasts#append_entries_rpc{ term = N }
-
b
increases local log index toM + 1
, and replies#append_entries_reply{ term = N, success = true }
toc
-
a
rejects the RPC asa
has a greater term thanc
(i.e., the local log index ofc
does not increase here)
-
-
c
andb
receive#request_vote_rpc{ term = N + 1 }
froma
(this message was sent during step 2-4):-
c
transitions tofollower
state (asc
has an smaller term) -
c
replies#request_vote_result{ vote_granded = false }
as the local log index ofc
is higher thana
-
b
replies#request_vote_result{ vote_granded = false }
as the local log index ofb
is higher thana
- => Repeatedly,
a
initiates new elections but is never chosen as the next leader becausea
has a smaller log index
-
- For some reason,
b
is stopped - By election timeout,
c
transitions topre_vote
state:-
c
broadcasts#pre_vote_rpc{ term = N_ }
- Where
N_
is an integer larger thanN
-
N_
is incremented bya
each timea
initiates a new election
- Where
-
a
ignores#pre_vote_rpc{ term = N_ }
asa
is incandidate
state anda
's term is always equal to or larger thanN_
-
c
cannot transition tocandidate
state as there are not majority votes - After the election timeout period has elapsed,
c
repeats step 6.
-
- There is no leader until humans take action (e.g., node restart)
-
a
remains incandidate
state (with a shorter log index thanc
) -
c
alternates betweenfollower
andpre_vote
states (with a term equal to or smaller thana
's term)
-
Commands and a patch for reproduction
Please execute the following commands to reproduce the scenario described above. (The reproduction rate is not 100%, but it is high in my environment.)
// Clone ra and checkut v2.10.1
$ git clone https://github.com/rabbitmq/ra.git
$ cd ra/
$ git checkout v2.10.1
// Apply the patch shown below
//
// [NOTE]
// This patch modifies the ra code, but only adjusts the execution and communication timing
// to make it easier to reproduce the issue.
// For example, it introduces a communication delay between two members.
$ git apply /path/to/ra.patch
// Start Erlang shell
$ rebar3 shell --sname foo@localhost
// Run a function to reproduce the above scenario
//
// [NOTE]
// The member names are `repro_a`, `repro_b`, and `repro_c`.
// They are respectively associated with `a`, `b`, and `c` in the scenario.
(foo@localhost)1> repro:run().
# create cluster
* [repro_c] init
* [repro_b] init
* [repro_a] init
* [repro_c] state_enter: recover
* [repro_a] state_enter: recover
* [repro_b] state_enter: recover
* [repro_c] state_enter: recovered
* [repro_a] state_enter: recovered
* [repro_b] state_enter: recovered
* [repro_c] state_enter: follower
* [repro_a] state_enter: follower
* [repro_b] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: follower
... repeat forever ...
ra.patch
diff --git a/src/ra_server.erl b/src/ra_server.erl
index 7fb5931..1d0fd57 100644
--- a/src/ra_server.erl
+++ b/src/ra_server.erl
@@ -984,7 +984,7 @@ handle_pre_vote(#pre_vote_result{term = Term, vote_granted = true,
token = Token},
#{current_term := Term,
votes := Votes,
- cfg := #cfg{log_id = LogId},
+ cfg := #cfg{id = Id, log_id = LogId},
pre_vote_token := Token,
cluster := Nodes} = State0) ->
?DEBUG("~ts: pre_vote granted ~w for term ~b votes ~b",
@@ -993,6 +993,24 @@ handle_pre_vote(#pre_vote_result{term = Term, vote_granted = true,
State = update_term(Term, State0),
case required_quorum(Nodes) of
NewVotes ->
+ case Id of
+ {repro_a, _} ->
+ %% Ensure that the log lengths of `repro_c` and `repro_b` are greater than that of `repro_a`
+ %% to prevent `repro_a` from becoming the new leader.
+ %% (NOTE: `#append_entries_rpc{}` from `repro_c` to `repro_a` is delayed.)
+ {ok, ok, _} = ra:process_command({repro_c, node()}, hello),
+
+ %% NOTE:
+ %% `repro_c` will transition from leader to follower at some point later
+ %% because `repro_a` will become `candidate` and increment the term.
+
+ %% Stop `repro_b`.
+ %% This ensure that it's mandatory for `repro_c` to gain a vote from
+ %% `repro_a` to be re-elected as the leader.
+ ok = ra:stop_server(default, {repro_b, node()});
+ _ ->
+ ok
+ end,
call_for_election(candidate, State);
_ ->
{pre_vote, State#{votes => NewVotes}, []}
diff --git a/src/ra_server_proc.erl b/src/ra_server_proc.erl
index 789d3cf..75394fc 100644
--- a/src/ra_server_proc.erl
+++ b/src/ra_server_proc.erl
@@ -342,11 +342,31 @@ do_init(#{id := Id,
low_priority_commands = ra_ets_queue:new(),
server_state = ServerState},
ok = net_kernel:monitor_nodes(true, [nodedown_reason]),
+ put(delayed_sender, spawn_link(fun() -> delayed_send(queue:new()) end)),
State.
%% callback mode
callback_mode() -> [state_functions, state_enter].
+delayed_send(Queue0) ->
+ Now = erlang:monotonic_time(millisecond),
+ Timeout =
+ case queue:peek(Queue0) of
+ empty ->
+ infinity;
+ {value, {SendTime, _, _}} ->
+ max(0, SendTime - Now)
+ end,
+ receive
+ {send, Delay, To, Msg} ->
+ Queue1 = queue:in({Now + Delay, To, Msg}, Queue0),
+ delayed_send(Queue1)
+ after Timeout ->
+ {{value, {_, To, Msg}}, Queue1} = queue:out(Queue0),
+ To ! Msg,
+ delayed_send(Queue1)
+ end.
+
%%%===================================================================
%%% State functions
%%%===================================================================
@@ -1630,6 +1650,20 @@ reject_command(Pid, Corr, #state{leader_monitor = _Mon} = State) ->
maybe_persist_last_applied(#state{server_state = NS} = State) ->
State#state{server_state = ra_server:persist_last_applied(NS)}.
+send({repro_a, _} = To, Msg, _Conf) ->
+ %% Add a delay when sending messages to `repro_a` server.
+ %%
+ %% Without this delay, the leader will prevent `repro_a` from transitioning to `candidate` state
+ %% by promptly sending an empty `#append_entries_rpc{}` upon receiving `#pre_vote_rpc{}` from `repro_a`.
+ Delay =
+ case get(ra_state) of
+ leader ->
+ 1000;
+ _ ->
+ 0
+ end,
+ get(delayed_sender) ! {send, Delay, To, Msg},
+ ok;
send(To, Msg, Conf) ->
% we do not want to block the ra server whilst attempting to set up
% a TCP connection to a potentially down node or when the distribution
diff --git a/src/repro.erl b/src/repro.erl
new file mode 100644
index 0000000..b1bda56
--- /dev/null
+++ b/src/repro.erl
@@ -0,0 +1,51 @@
+-module(repro).
+
+-behaviour(ra_machine).
+
+-export([run/0]).
+
+-export([init/1, apply/3, state_enter/2]).
+
+
+run() ->
+ _ = file:del_dir_r("foo@localhost"),
+ ok = ra:start(),
+
+ %% Create a cluster with 3 members.
+ io:format("# create cluster~n"),
+ Module = ?MODULE,
+ Machine = {module, ?MODULE, #{}},
+ Node = node(),
+ ServerIds = [{repro_a, Node}, {repro_b, Node}, {repro_c, Node}],
+ {ok, _ServersStarted, []} = ra:start_cluster(default, Module, Machine, ServerIds),
+
+ io:format("# Please wait 5 seconds...~n"),
+ ok = timer:sleep(5000),
+
+ %% Assumes repro_c is the leader.
+ {repro_c, Node} = maps:get(leader_id, element(2, ra:member_overview(repro_a))),
+
+ %% Trigger an election that will cause the problem described in this issue.
+ io:format("# trigger election~n"),
+ ok = ra:trigger_election({repro_a, node()}),
+
+ ok.
+
+
+init(_) ->
+ io:format("* [~p] init~n", [name()]),
+ #{}.
+
+
+apply(_Metadata, _Command, State) ->
+ {State, ok}.
+
+
+state_enter(RaState, _State) ->
+ io:format("* [~p] state_enter: ~p~n", [name(), RaState]),
+ put(ra_state, RaState),
+ [].
+
+
+name() ->
+ element(2, erlang:process_info(self(), registered_name)).
Expected behavior
A leader should eventually be elected if the majority of members are alive.
Additional context
No response
@sile thank you for the detailed report.
Please open a pull request with your patch. If your repro.erl
code can be turned into a test for this issue, that would be great. Thank you.
@lukebakken Thank you for your response.
I will work on fixing this issue and submit a pull request. Writing a unit test seems challenging, but I will give it a try as well.
By the way, if there is a design document about the pre_vote
state, please let me know. I am interested in understanding why this new state needed to be introduced into ra
, what properties or invariants this state should maintain, and any other relevant details. This information would be very helpful as I consider the best approach to address this issue.