consul Consul - leader instability - Raft leader not found in server lookup mapping

Consul Cluster got into a unstable state

Our consul deployment in a k8s cluster (through helm) - 3 servers. We saw an error from one of the consul-servers-0 that

2024-09-16 21:22 agent: Coordinate update error: error="Raft leader not found in server lookup mapping"

and then periodic errors (twice or three times a day) after, like this...

### From the consul agents
agent.http: Request error: method=GET url=/v1/catalog/services from=192.168.187.228:59522 error="rpc error making call: Raft leader not found in server lookup mapping"
agent.client: RPC failed to server: method=Catalog.ListServices server=192.168.70.185:8300 error="rpc error making call: Raft leader not found in server lookup mapping"

###From the consul servers consul-server-0 and consul-server-1
agent.server: Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured.: leader=192.168.x.y:8300 (consul-server-2)

the leader did not change during the time of the first error
there was a user impact in that certain requests did fail

There was no other network maintenance, or issues with cpu/memory etc. Any thoughts on why did this fail, how to recover now (should we restart the leader consul-server-2) and any parameters need be tuned to avoid recurrence?

Consul info for both Client and Server

Client info

agent:
        check_monitors = 0
        check_ttls = 39
        checks = 39
        services = 39
build:
        prerelease =
        revision = 2c56447e
        version = 1.11.1
consul:
        acl = disabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 80
        goroutines = 125
        max_procs = 80
        os = linux
        version = go1.17.5
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 40
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1020
        members = 7
        query_queue = 0
        query_time = 1

Client agent HCL config

Server info

/ $ consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 2c56447e
        version = 1.11.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.56.152:8300
        server = true
raft:
        applied_index = 28764675
        commit_index = 28764675
        fsm_pending = 0
        last_contact = 4.451653ms
        last_log_index = 28764675
        last_log_term = 172
        last_snapshot_index = 28759968
        last_snapshot_term = 172
        latest_configuration = [{Suffrage:Voter ID:1b113a54-186f-d265-4e56-e6653f61abc6 Address:192.168.187.233:8300} {Suffrage:Voter ID:c266bd28-7448-8a20-18d6-2b075aaf901a Address:192.168.56.152:8300} {Suffrage:Voter ID:71762986-ddd4-e5d5-d147-6175df021bcc Address:192.168.70.185:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 172
runtime:
        arch = amd64
        cpu_count = 80
        goroutines = 200
        max_procs = 80
        os = linux
        version = go1.17.5
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 40
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1020
        members = 7
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 407
        members = 3
        query_queue = 0
        query_time = 1
/ $

Server agent HCL config

Operating system and Environment details

this environment runs on a k8s cluster, v1.24. Consul is version 1.11

Log Fragments

Sep 26 '24 12:09 krishgu

I hit the same problem with Consul v1.19.2. On a cluster with 5 nodes elections were started after lost connection to 2 nodes from another datacenter. The leadership was passed from one consul node to another one (from the same datacenter where 3 nodes are installed). (BTW, I don't understand why the leadership was changed.) Since the election, the two disconnected nodes are back, the cluster is almost OK, but the old leader does not respond on queries with the error Raft leader not found in server lookup mapping. Other nodes work correctly and indicate the new leader correctly. The error Raft has a leader but other tracking of the node would indicate that the node is unhealthy or does not exist. The network may be misconfigured is shown on the old leader as well.

Nov 21 '24 10:11 Peter2121

is there any sollution for this?

Mar 22 '25 11:03 mahendrasomavarapu