Consul - leader instability - Raft leader not found in server lookup mapping
Consul Cluster got into a unstable state
Our consul deployment in a k8s cluster (through helm) - 3 servers. We saw an error from one of the consul-servers-0 that
2024-09-16 21:22 agent: Coordinate update error: error="Raft leader not found in server lookup mapping"
and then periodic errors (twice or three times a day) after, like this...
### From the consul agents
agent.http: Request error: method=GET url=/v1/catalog/services from=192.168.187.228:59522 error="rpc error making call: Raft leader not found in server lookup mapping"
agent.client: RPC failed to server: method=Catalog.ListServices server=192.168.70.185:8300 error="rpc error making call: Raft leader not found in server lookup mapping"
###From the consul servers consul-server-0 and consul-server-1
agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=192.168.x.y:8300 (consul-server-2)
- the leader did not change during the time of the first error
- there was a user impact in that certain requests did fail
There was no other network maintenance, or issues with cpu/memory etc. Any thoughts on why did this fail, how to recover now (should we restart the leader consul-server-2) and any parameters need be tuned to avoid recurrence?
Consul info for both Client and Server
Client info
agent:
check_monitors = 0
check_ttls = 39
checks = 39
services = 39
build:
prerelease =
revision = 2c56447e
version = 1.11.1
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 80
goroutines = 125
max_procs = 80
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 40
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1020
members = 7
query_queue = 0
query_time = 1
Client agent HCL config
Server info
/ $ consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 2c56447e
version = 1.11.1
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 192.168.56.152:8300
server = true
raft:
applied_index = 28764675
commit_index = 28764675
fsm_pending = 0
last_contact = 4.451653ms
last_log_index = 28764675
last_log_term = 172
last_snapshot_index = 28759968
last_snapshot_term = 172
latest_configuration = [{Suffrage:Voter ID:1b113a54-186f-d265-4e56-e6653f61abc6 Address:192.168.187.233:8300} {Suffrage:Voter ID:c266bd28-7448-8a20-18d6-2b075aaf901a Address:192.168.56.152:8300} {Suffrage:Voter ID:71762986-ddd4-e5d5-d147-6175df021bcc Address:192.168.70.185:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 172
runtime:
arch = amd64
cpu_count = 80
goroutines = 200
max_procs = 80
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 40
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1020
members = 7
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 407
members = 3
query_queue = 0
query_time = 1
/ $
Server agent HCL config
Operating system and Environment details
this environment runs on a k8s cluster, v1.24. Consul is version 1.11
Log Fragments
I hit the same problem with Consul v1.19.2. On a cluster with 5 nodes elections were started after lost connection to 2 nodes from another datacenter. The leadership was passed from one consul node to another one (from the same datacenter where 3 nodes are installed). (BTW, I don't understand why the leadership was changed.) Since the election, the two disconnected nodes are back, the cluster is almost OK, but the old leader does not respond on queries with the error Raft leader not found in server lookup mapping. Other nodes work correctly and indicate the new leader correctly. The error Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured is shown on the old leader as well.
is there any sollution for this?