Cluster does not achieve leader election with 2 of 3 servers up after reboot
Overview of the Issue
A kernel bug crashes the 3 Consul servers in my production cluster. After 2 of 3 servers rebooted, they failed to achieve leader election.
Only when the 3rd server came back online 20 minutes later, did the Consul cluster elect a leader. This is against the HA promise of consul, that 2-out-of-3 consensus should be enough. My cluster did not tolerate a single of 3 notes being down, and the Key-Value-Store did not work durin this time, which is problematic, since I use it for orchestrating all other parts of the cluster.
The journalctl -u consul of the 2 machines that were up showed them cycling the following raft logs (see section further down for timestamps):
[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
[INFO] agent.server.raft: duplicate requestVote for same term: term=1931652
[WARN] agent.server.raft: Election timeout reached, restarting election
[INFO] agent.server.raft: entering candidate state: node="Node at 10.0.0.5:8300 [Candidate]" term=1931653
[WARN] agent.server.raft: unable to get address for server, using fallback address: id=0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 fallback=10.0.0.6:8300 error="Could not find address for server id 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7"
[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
The two available nodes were 10.0.0.4 and 10.0.0.5. The down node was 10.0.0.6.
I interpret the [ERROR] agent.server.raft: failed to make requestVote RPC: ... dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable as that the absence of this single node made the consensus fail, but maybe that is a misinterpretation of the log?
Reproduction Steps
Steps to reproduce this issue, eg:
- Create a cluster with arbitary client nodes and 3 server nodes
- Crash 3 server nodes simultaneously (kernel panic in an unrelated Ceph kernel module in my case)
- Turn back on 2 of the 3 servers. Observe above behaviour.
- Turn 3rd server back on. Observe cluster recovery.
I do not know if this is a reliable reproducer for this issue.
Consul info for both Client and Server
Consul v1.10.3 on NixOS 21.11 on all machines.
Server info (after full recovery)
# consul info
agent:
check_monitors = 1
check_ttls = 1
checks = 6
services = 4
build:
prerelease =
revision =
version = 1.10.3
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 10.0.0.5:8300
server = true
raft:
applied_index = 1297977634
commit_index = 1297977634
fsm_pending = 0
last_contact = 35.713276ms
last_log_index = 1297977634
last_log_term = 1932011
last_snapshot_index = 1297965123
last_snapshot_term = 1932011
latest_configuration = [{Suffrage:Voter ID:faa1b93a-3e3a-7608-42fb-8de8a7b62c73 Address:10.0.0.5:8300} {Suffrage:Voter ID:652b2790-dcac-e5a2-b2a7-a183b36d0d91 Address:10.0.0.4:8300} {Suffrage:Voter ID:0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 Address:10.0.0.6:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 1932011
runtime:
arch = amd64
cpu_count = 16
goroutines = 199
max_procs = 16
os = linux
version = go1.16.9
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 5685
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 3246
members = 11
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 2422
members = 3
query_queue = 0
query_time = 1
`consul members` (during 1 `node-6` server being down)
root@node-5 ~ # consul members
Node Address Status Type Build Protocol DC Segment
node-4 10.0.0.4:8301 alive server 1.10.3 2 dc1 <all>
node-5 10.0.0.5:8301 alive server 1.10.3 2 dc1 <all>
othernode-1 10.0.1.3:8301 alive client 1.10.3 2 dc1 <default>
othernode-2 10.0.1.1:8301 alive client 1.10.3 2 dc1 <default>
othernode-3 10.0.3.1:8301 alive client 1.10.3 2 dc1 <default>
othernode-4 10.0.0.1:8301 alive client 1.10.3 2 dc1 <default>
othernode-5 10.0.0.2:8301 alive client 1.10.3 2 dc1 <default>
othernode-6 10.0.0.3:8301 alive client 1.10.3 2 dc1 <default>
othernode-7 10.0.2.1:8301 alive client 1.10.3 2 dc1 <default>
othernode-8 10.0.2.2:8301 alive client 1.10.3 2 dc1 <default>
Operating system and Environment details
NixOS Linux 21.11, x86_64
Log Fragments
Logs of node-5 with timestamps:
Jan 10 16:49:40 node-5 consul[1294]: 2022-01-10T16:49:40.945Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
Jan 10 16:49:41 node-5 consul[1294]: 2022-01-10T16:49:41.619Z [INFO] agent.server.raft: duplicate requestVote for same term: term=1931652
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.0.0.5:8300 [Candidate]" term=1931653
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 fallback=10.0.0.6:8300 error="Could not find address for server id 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7"
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
Hey @nh2
Couple questions:
- What was the third unreachable node doing during this time?
- Did the address of the machines change on restart?
Loss of a single node shouldn't result in the consensus falling under normal conditions, but it's hard to tell since the initial outage was of multiple nodes ( which could cause some data loss )
The Outage Recovery docs recommends some extra steps in the case of multiple-server failures, so i'm curious to know if you'd still experience this issue following that process as well.
Hey @Amier3, thanks for looking at this.
What was the third unreachable node doing during this time?
It was powered off.
Did the address of the machines change on restart?
No, static IPs are configured.
since the initial outage was of multiple nodes ( which could cause some data loss )
Should there be any data loss? As far as I understand, Consul's replication ensures synchronous durable writes to disk using fsync(); that means that all data Consul acknowledged as written to disk will still be there after the reboot.
There is only one scenario I can think of that might explain this (in absence of bugs):
- Servers A, B, C are operating normally.
- Now A, B, C, power off almost simultaneously together, but A powers off 100 ms faster, and {B, C} still serve a Consul write in that time.
- Now {A,B} come back, but they cannot achieve quorum because A is behind in the Raft log.
Would the consul logs be able to tell me whether this scenario really happened (e.g. by telling me that B is refusing to work with A because it is behind in the Raft log)?
@nh2
Apologies for the delayed response. We'll need to see the raft configuration in order to figure out why the leader vote failed. You can get that with consul operator raft list-peers . It'll also be helpful to see a longer log from the journalctl -u consul
A minimum of 3 nodes is required to make a majority right for the leader election. With 2 nodes a tie break is required.
Dropping a manually created peer list json file into a stuck production system is not really the most "confidence inspiring" recovery process. Are you working on an automated fix for this, so nodes don't get stuck?
I experience the same issue on staging cluster. consul operator raft list-peers responds with Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader) so it's not really helpful. Both nodes on consul members respond
Node Address Status Type Build Protocol DC Partition Segment
consul-consul-server-0 100.92.229.24:8301 alive server 1.20.2 2 dc1 default <all>
consul-consul-server-2 100.85.51.103:8301 alive server 1.20.2 2 dc1 default <all>
The issue can be easily reproduced by losing quorum for a little while and getting it back (3->2->1->2 members).
Found the way to recover from this state is using this manual DR guide: https://developer.hashicorp.com/consul/tutorials/operate-consul/recovery-outage#manual-recovery-using-peers-json (rely upon it for details).
- Identified the latest raft state with
consul info | grep applied_index(to minify data loss) - On the corresponding node, created
peers.jsonwith ONLY ONE MEMBER. - Killed the pod (restarted consul agent). After restart it got back online.
- Seems, after restart it recovered in Leader (finally!), then distributed the state to other members.
- And after other members joined, it removed
peers.json. - All pods (services) came back to Running state.
- I haven't identified data loss in K/V after this operation.
Hope it will help people who experience the same issue. It would be awesome if official developers confirm that it's safe procedure, because I cannot guarantee so.