consul-k8s Failed leadership election with three node cluster in GKE (Consul v1.5.2)

Cross posting from https://discuss.hashicorp.com/t/failed-leadership-election-with-three-node-cluster-in-gke-consul-v1-5-2/50207/4

Yesterday we experienced failed cluster leadership election during a routine node pool upgrade in GKE. This left the cluster without a leader for 30+ minutes.

The timeline of events is as follows:

2023-02-14 07:40:12.638 consul-consul-server-2 2023/02/14 07:40:12 [INFO] agent: Caught signal: terminated — (consul-server-2 goes down (leader)
2023-02-14 07:40:15.218 consul-consul-server-1 2023/02/14 07:40:15 [INFO] consul: New leader elected: consul-consul-server-1 — consul-server-1 becomes leader
2023-02-14 07:46:59.824 consul-consul-server-2 2023/02/14 07:46:59 [INFO] agent: (LAN) joined: 3 — consul-server-2 comes back online
2023-02-14 07:47:01.233 consul-consul-server-1 2023/02/14 07:47:01 [INFO] agent: Caught signal: terminated – consul-server-1 goes down (leader)
2023-02-14 07:47:09.709 consul-consul-server-0 2023/02/14 07:47:09 [WARN] raft: Election timeout reached, restarting election — consul-server-0 is unable to elect a new leader as it is not talking to consul-server-2

It looks like from the logs consul-server-0 and consul-server-2 were unable to elect a leader before consul-server-1 went down.

What we don’t understand at this point is how the cluster got itself into this state.

Autopilot config:

CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
RedundancyZoneTag = “”
DisableUpgradeMigration = false
UpgradeVersionTag = “”

We are using the stock helm chart with only a couple of tweaks:

{
  "performance": {
    "raft_multiplier": 1
  },
  "disable_update_check": true,
  "telemetry": {
    "prometheus_retention_time": "20s"
  }
}

Logs

"consul-consul-server-0","    2023/02/14 07:47:08 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-1","    2023/02/14 07:47:07 [ERR] agent: Coordinate update error: No cluster leader"
"consul-consul-server-1","    2023/02/14 07:47:07 [WARN] consul: error getting server health from ""consul-consul-server-2"": rpc error getting client: failed to get conn: dial tcp <nil>->10.239.114.198:8300: i/o timeout"
"consul-consul-server-0","    2023/02/14 07:47:06 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1621"
"consul-consul-server-0","    2023/02/14 07:47:06 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-2","    2023/02/14 07:47:05 [INFO] memberlist: Suspect consul-consul-server-1 has failed, no acks received"
"consul-consul-server-0","    2023/02/14 07:47:05 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1620"
"consul-consul-server-0","    2023/02/14 07:47:05 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-0","    2023/02/14 07:47:04 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1619"
"consul-consul-server-0","    2023/02/14 07:47:04 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-0","    2023/02/14 07:47:02 [ERROR] raft: Failed to make RequestVote RPC to {Voter e6758c6f-e4ca-cb09-62a6-ff6a209747f8 10.239.112.134:8300}: EOF"
"consul-consul-server-1","    2023/02/14 07:47:02 [ERR] raft-net: Failed to decode incoming command: transport shutdown"
"consul-consul-server-0","    2023/02/14 07:47:02 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1618"
"consul-consul-server-0","    2023/02/14 07:47:02 [WARN]  raft: Heartbeat timeout from ""10.239.112.134:8300"" reached, starting election"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO]  raft: aborting pipeline replication to peer {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300}"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO]  raft: aborting pipeline replication to peer {Voter 4c238024-cf5f-cff9-0e6f-028eba511645 10.239.113.198:8300}"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] manager: shutting down"
"consul-consul-server-1","    2023/02/14 07:47:01 [WARN] serf: Shutdown without a Leave"
"consul-consul-server-1","    2023/02/14 07:47:01 [WARN] serf: Shutdown without a Leave"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] consul: shutting down server"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Requesting shutdown"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Graceful shutdown disabled. Exiting"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Caught signal:  terminated"
"consul-consul-server-1","    2023/02/14 07:47:00 [INFO]  raft: pipelining replication to peer {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300}"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Synced node info"
"consul-consul-server-1","    2023/02/14 07:46:59 [WARN]  raft: AppendEntries to {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300} rejected, sending older logs (next: 25953103)"
"consul-consul-server-2","    2023/02/14 07:46:59 [WARN]  raft: Failed to get previous log: 25953359 log not found (last: 25953102)"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: member 'consul-consul-server-2' joined, marking health alive"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Added peer 795d24ab-3d36-05dd-046c-393be5c94ed3, starting replication"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Updating configuration with AddNonvoter (795d24ab-3d36-05dd-046c-393be5c94ed3, 10.239.114.135:8300) to [{Suffrage:Voter ID:e6758c6f-e4ca-cb09-62a6-ff6a209747f8 Address:10.239.112.134:8300} {Suffrage:Voter ID:4c238024-cf5f-cff9-0e6f-028eba511645 Address:10.239.113.198:8300} {Suffrage:Nonvoter ID:795d24ab-3d36-05dd-046c-393be5c94ed3 Address:10.239.114.135:8300}]"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: removed server with duplicate ID: 795d24ab-3d36-05dd-046c-393be5c94ed3"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Join LAN completed. Synced with 3 initial agents"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: (LAN) joined: 3"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Removed peer 795d24ab-3d36-05dd-046c-393be5c94ed3, stopping replication after 25953358"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Updating configuration with RemoveServer (795d24ab-3d36-05dd-046c-393be5c94ed3, ) to [{Suffrage:Voter ID:e6758c6f-e4ca-cb09-62a6-ff6a209747f8 Address:10.239.112.134:8300} {Suffrage:Voter ID:4c238024-cf5f-cff9-0e6f-028eba511645 Address:10.239.113.198:8300}]"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-2 (Addr: tcp/10.239.114.135:8300) (DC: <redacted>)"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-2 10.239.114.135"
"consul-consul-server-0","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-2 (Addr: tcp/10.239.114.135:8300) (DC: <redacted>)"
"consul-consul-server-0","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-2 10.239.114.135"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-1 (Addr: tcp/10.239.112.134:8300) (DC: <redacted>)"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-0 (Addr: tcp/10.239.113.198:8300) (DC: <redacted>)"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-1 10.239.112.134"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: (LAN) joining: [consul-consul-server-0.consul-consul-server.consul.svc:8301 consul-consul-server-1.consul-consul-server.consul.svc:8301 consul-consul-server-2.consul-consul-server.consul.svc:8301]"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Joining LAN cluster..."
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s mdns os packet scaleway softlayer triton vsphere"
"consul-consul-server-2","==> Consul agent running!"

@maxb from the forum posted this very insightful analysis:

I do need to begin by pointing out that Consul 1.5 is an unreasonably ancient version to still be running these days. That aside…

The problem here is that consul-server-1 was taken down at a point where it was critical to maintain the quorum.

Raft incorporates various features to try to keep the cluster healthy. Sometimes these features help… and at other times they backfire horribly, creating the outage they were intended to prevent.

In this case the feature at issue is “server stabilization time” which says that after a server joins the cluster, it must spend some time healthy as a non-voter, i.e. not part of the quorum, before it can be promoted to a voting member.

Here is the timeline:

Initial state: Cluster size 3, quorum requires 2 of 3 servers

consul-server-2 down: Quorum is maintained by -1 and -0 still being up

consul-server-2 up, with a new IP address:

The current leader observes a new server with a known ID but a changed address, and removes the version of consul-server-2 with the old address from the cluster. The cluster size is now 2, and quorum requires all 2 of those 2 servers to remain up.

The current leader re-adds a new version of consul-server-2 to the cluster, but for now, as a non-voter. The voting cluster size is still only 2.

In 10 seconds, the non-voter would be eligible for promotion to a voter, but before that can happen…

consul-server-1 is taken down for maintenance! QUORUM NOW BROKEN. Only 1 voter member of 2 is available.

Feb 20 '23 11:02 leetreveil

Could you add some information about the process that was controlling the rolling restart of the Consul server nodes?

If it was an external sequencer that was attempting to understand when it was safe to kill another pod using its own logic, that logic may be at fault. Whereas, if the decision on timing when to kill another pod was based on readiness checking and PodDisruptionBudget provided by the consul-k8s project, that would indicate a fault in that.

Feb 20 '23 11:02 maxb

Whereas, if the decision on timing when to kill another pod was based on readiness checking and PodDisruptionBudget provided by the consul-k8s project, that would indicate a fault in that.

This is correct. We operate Consul in GKE where during a nodepool upgrade, k8s will respect the statefulset's PDB of N-1.

When a node is upgraded, the following happens with the Pods:

The node is cordoned so that Kubernetes does not schedule new Pods on it.

The node is then drained, meaning that the Pods are removed. For surge upgrades, GKE respects the Pod's PodDisruptionBudget and GracefulTerminationPeriod settings for up to one hour. With blue-green upgrades, this can be extended if you configure a longer soaking time.

The control plane reschedules Pods managed by controllers onto other nodes. Pods that cannot be rescheduled stay in the Pending phase until they can be rescheduled.

With regards to what happened in this particular scenario, as soon as consul-consul-server-2 comes up and reports to k8s as being healthy per the readiness check:

Readiness:  exec [
/bin/sh -ec curl http://127.0.0.1:8500/v1/status/leader 2>/dev/null | grep -E '".+"'
] delay=5s timeout=5s period=3s #success=1 #failure=2

GKE will cordon the next node and begin draining workloads resulting in consul-consul-server-1 being sent the SIGTERM signal.

Feb 22 '23 14:02 leetreveil

Thanks... it seems fairly clear at this point that the default readinessProbe in https://github.com/hashicorp/consul-k8s/blob/main/charts/consul/templates/server-statefulset.yaml is insufficient, and needs to be replaced with a custom command that understands that a server pod cannot be considered ready if has non-voter status in the Raft configuration.

Feb 22 '23 15:02 maxb

To mitigate we increased readinessProbe.initialDelaySeconds to 120.

This gives the consul server enough time to boot, figure out DNS and transition to voter.

This works because the headless server publishes not ready addresses: https://github.com/hashicorp/consul-k8s/blob/main/charts/consul/templates/server-service.yaml#L26 which allows Consuls to gossip before the pod transitions to ready.

Once the pod transitions to ready, k8s will immediately kill the next pod as per the PDB.

Mar 18 '23 16:03 leetreveil

Hi @leetreveil, did I understand correctly that changing the readinessProbe.initialDelaySeconds to 120 and increasing the number of replicas to 5 solved the problem?

Jul 13 '23 11:07 andreyreshetnikov-zh

Correct.

Five nodes in combination with a 120 second initial delay significantly decreases the probability of complete leadership loss.

You should measure how long it takes for a node to transition to a voter and adjust the initial delay as appropriate for your environment.

Jul 13 '23 11:07 leetreveil

consul-k8s consul-k8s copied to clipboard

Failed leadership election with three node cluster in GKE (Consul v1.5.2)

consul-k8s
consul-k8s copied to clipboard