consul icon indicating copy to clipboard operation
consul copied to clipboard

Cluster does not achieve leader election with 2 of 3 servers up after reboot

Open nh2 opened this issue 3 years ago • 7 comments

Overview of the Issue

A kernel bug crashes the 3 Consul servers in my production cluster. After 2 of 3 servers rebooted, they failed to achieve leader election.

Only when the 3rd server came back online 20 minutes later, did the Consul cluster elect a leader. This is against the HA promise of consul, that 2-out-of-3 consensus should be enough. My cluster did not tolerate a single of 3 notes being down, and the Key-Value-Store did not work durin this time, which is problematic, since I use it for orchestrating all other parts of the cluster.

The journalctl -u consul of the 2 machines that were up showed them cycling the following raft logs (see section further down for timestamps):

[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
[INFO]  agent.server.raft: duplicate requestVote for same term: term=1931652
[WARN]  agent.server.raft: Election timeout reached, restarting election
[INFO]  agent.server.raft: entering candidate state: node="Node at 10.0.0.5:8300 [Candidate]" term=1931653
[WARN]  agent.server.raft: unable to get address for server, using fallback address: id=0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 fallback=10.0.0.6:8300 error="Could not find address for server id 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7"
[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"

The two available nodes were 10.0.0.4 and 10.0.0.5. The down node was 10.0.0.6.

I interpret the [ERROR] agent.server.raft: failed to make requestVote RPC: ... dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable as that the absence of this single node made the consensus fail, but maybe that is a misinterpretation of the log?

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with arbitary client nodes and 3 server nodes
  2. Crash 3 server nodes simultaneously (kernel panic in an unrelated Ceph kernel module in my case)
  3. Turn back on 2 of the 3 servers. Observe above behaviour.
  4. Turn 3rd server back on. Observe cluster recovery.

I do not know if this is a reliable reproducer for this issue.

Consul info for both Client and Server

Consul v1.10.3 on NixOS 21.11 on all machines.

Server info (after full recovery)
# consul info
agent:
  check_monitors = 1
  check_ttls = 1
  checks = 6
  services = 4
build:
  prerelease = 
  revision = 
  version = 1.10.3
consul:
  acl = disabled
  bootstrap = false
  known_datacenters = 1
  leader = false
  leader_addr = 10.0.0.5:8300
  server = true
raft:
  applied_index = 1297977634
  commit_index = 1297977634
  fsm_pending = 0
  last_contact = 35.713276ms
  last_log_index = 1297977634
  last_log_term = 1932011
  last_snapshot_index = 1297965123
  last_snapshot_term = 1932011
  latest_configuration = [{Suffrage:Voter ID:faa1b93a-3e3a-7608-42fb-8de8a7b62c73 Address:10.0.0.5:8300} {Suffrage:Voter ID:652b2790-dcac-e5a2-b2a7-a183b36d0d91 Address:10.0.0.4:8300} {Suffrage:Voter ID:0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 Address:10.0.0.6:8300}]
  latest_configuration_index = 0
  num_peers = 2
  protocol_version = 3
  protocol_version_max = 3
  protocol_version_min = 0
  snapshot_version_max = 1
  snapshot_version_min = 0
  state = Follower
  term = 1932011
runtime:
  arch = amd64
  cpu_count = 16
  goroutines = 199
  max_procs = 16
  os = linux
  version = go1.16.9
serf_lan:
  coordinate_resets = 0
  encrypted = false
  event_queue = 0
  event_time = 5685
  failed = 0
  health_score = 0
  intent_queue = 0
  left = 0
  member_time = 3246
  members = 11
  query_queue = 0
  query_time = 1
serf_wan:
  coordinate_resets = 0
  encrypted = false
  event_queue = 0
  event_time = 1
  failed = 0
  health_score = 0
  intent_queue = 0
  left = 0
  member_time = 2422
  members = 3
  query_queue = 0
  query_time = 1
`consul members` (during 1 `node-6` server being down)
root@node-5 ~ # consul members
Node          Address        Status  Type    Build   Protocol  DC   Segment
node-4        10.0.0.4:8301  alive   server  1.10.3  2         dc1  <all>
node-5        10.0.0.5:8301  alive   server  1.10.3  2         dc1  <all>
othernode-1   10.0.1.3:8301  alive   client  1.10.3  2         dc1  <default>
othernode-2   10.0.1.1:8301  alive   client  1.10.3  2         dc1  <default>
othernode-3   10.0.3.1:8301  alive   client  1.10.3  2         dc1  <default>
othernode-4   10.0.0.1:8301  alive   client  1.10.3  2         dc1  <default>
othernode-5   10.0.0.2:8301  alive   client  1.10.3  2         dc1  <default>
othernode-6   10.0.0.3:8301  alive   client  1.10.3  2         dc1  <default>
othernode-7   10.0.2.1:8301  alive   client  1.10.3  2         dc1  <default>
othernode-8   10.0.2.2:8301  alive   client  1.10.3  2         dc1  <default>

Operating system and Environment details

NixOS Linux 21.11, x86_64

Log Fragments

Logs of node-5 with timestamps:

Jan 10 16:49:40 node-5 consul[1294]: 2022-01-10T16:49:40.945Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"
Jan 10 16:49:41 node-5 consul[1294]: 2022-01-10T16:49:41.619Z [INFO]  agent.server.raft: duplicate requestVote for same term: term=1931652
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [INFO]  agent.server.raft: entering candidate state: node="Node at 10.0.0.5:8300 [Candidate]" term=1931653
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 fallback=10.0.0.6:8300 error="Could not find address for server id 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7"
Jan 10 16:49:42 node-5 consul[1294]: 2022-01-10T16:49:42.493Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b794b23-bb5d-6e76-7bf8-42b7df2fbea7 10.0.0.6:8300}" error="dial tcp 10.0.0.5:0->10.0.0.6:8300: connect: network is unreachable"

nh2 avatar Jan 10 '22 17:01 nh2

Hey @nh2

Couple questions:

  • What was the third unreachable node doing during this time?
  • Did the address of the machines change on restart?

Loss of a single node shouldn't result in the consensus falling under normal conditions, but it's hard to tell since the initial outage was of multiple nodes ( which could cause some data loss )

The Outage Recovery docs recommends some extra steps in the case of multiple-server failures, so i'm curious to know if you'd still experience this issue following that process as well.

Amier3 avatar Jan 18 '22 15:01 Amier3

Hey @Amier3, thanks for looking at this.

What was the third unreachable node doing during this time?

It was powered off.

Did the address of the machines change on restart?

No, static IPs are configured.

since the initial outage was of multiple nodes ( which could cause some data loss )

Should there be any data loss? As far as I understand, Consul's replication ensures synchronous durable writes to disk using fsync(); that means that all data Consul acknowledged as written to disk will still be there after the reboot.


There is only one scenario I can think of that might explain this (in absence of bugs):

  1. Servers A, B, C are operating normally.
  2. Now A, B, C, power off almost simultaneously together, but A powers off 100 ms faster, and {B, C} still serve a Consul write in that time.
  3. Now {A,B} come back, but they cannot achieve quorum because A is behind in the Raft log.

Would the consul logs be able to tell me whether this scenario really happened (e.g. by telling me that B is refusing to work with A because it is behind in the Raft log)?

nh2 avatar Jan 18 '22 17:01 nh2

@nh2

Apologies for the delayed response. We'll need to see the raft configuration in order to figure out why the leader vote failed. You can get that with consul operator raft list-peers . It'll also be helpful to see a longer log from the journalctl -u consul

Amier3 avatar Jan 24 '22 15:01 Amier3

A minimum of 3 nodes is required to make a majority right for the leader election. With 2 nodes a tie break is required.

max19931 avatar May 11 '22 05:05 max19931

Dropping a manually created peer list json file into a stuck production system is not really the most "confidence inspiring" recovery process. Are you working on an automated fix for this, so nodes don't get stuck?

runesl avatar Nov 24 '22 13:11 runesl

I experience the same issue on staging cluster. consul operator raft list-peers responds with Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader) so it's not really helpful. Both nodes on consul members respond

Node                    Address             Status  Type    Build   Protocol  DC   Partition  Segment
consul-consul-server-0  100.92.229.24:8301  alive   server  1.20.2  2         dc1  default    <all>
consul-consul-server-2  100.85.51.103:8301  alive   server  1.20.2  2         dc1  default    <all>

The issue can be easily reproduced by losing quorum for a little while and getting it back (3->2->1->2 members).

MrEcco avatar Aug 06 '25 11:08 MrEcco

Found the way to recover from this state is using this manual DR guide: https://developer.hashicorp.com/consul/tutorials/operate-consul/recovery-outage#manual-recovery-using-peers-json (rely upon it for details).

  1. Identified the latest raft state with consul info | grep applied_index (to minify data loss)
  2. On the corresponding node, created peers.json with ONLY ONE MEMBER.
  3. Killed the pod (restarted consul agent). After restart it got back online.
  4. Seems, after restart it recovered in Leader (finally!), then distributed the state to other members.
  5. And after other members joined, it removed peers.json.
  6. All pods (services) came back to Running state.
  7. I haven't identified data loss in K/V after this operation.

Hope it will help people who experience the same issue. It would be awesome if official developers confirm that it's safe procedure, because I cannot guarantee so.

MrEcco avatar Aug 06 '25 12:08 MrEcco