consul
consul copied to clipboard
"agent.server.serf.lan: serf: query reply ID mismatch" triggered by widespread keyring queries
Overview of the Issue
When reading LAN and WAN keyrings (operator/keyring) from every node in a cluster (~300 node) which is also federated with several other similar cluster, we start seeing the following messages for the WAN and LAN gossip pools:
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)
Reproduction Steps
Consul info for both Client and Server
Server info
agent:
check_monitors = 16
check_ttls = 1
checks = 30
services = 26
build:
prerelease =
revision = 7ba4e1b+
version = 1.13.5+3
version_metadata =
consul:
acl = enabled
bootstrap = false
known_datacenters = 9
leader = false
leader_addr = 10.98.8.4:8498
server = true
raft:
applied_index = 2951624713
commit_index = 2951624713
fsm_pending = 0
last_contact = 1.991887ms
last_log_index = 2951624713
last_log_term = 7256
last_snapshot_index = 2951614510
last_snapshot_term = 7256
latest_configuration = [{Suffrage:Voter ID:9fbe7e25-0371-578d-ac8e-886b4dfa9a5d Address:10.98.8.4:8498} {Suffrage:Voter ID:542c9d7e-732c-5803-8e36-ba11168bdf99 Address:10.98.8.5:8498} {Suffrage:Voter ID:97e86b31-ae3e-52f7-8746-4876adbbf2c4 Address:10.98.9.44:8498} {Suffrage:Voter ID:5f909fab-fe67-5502-bf8a-f561c8c66e08 Address:10.98.10.65:8498} {Suffrage:Voter ID:8ce84c87-1b7c-5836-aa21-24e5aa087d78 Address:10.98.10.213:8498}]
latest_configuration_index = 0
num_peers = 4
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 7256
runtime:
arch = amd64
cpu_count = 128
goroutines = 9258
max_procs = 128
os = linux
version = go1.21.3 X:boringcrypto
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 487
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 150611
members = 223
query_queue = 0
query_time = 310159
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 144768
members = 33
query_queue = 0
query_time = 231467
too much redaction required for the whole thing config.hcl, some highlights below
bind_addr = "0.0.0.0"
serf_wan = "[::]"
translate_wan_addrs = true
ports = {
server = 8498
serf_lan = 8499
http = 8500
dns = 8600
serf_wan = 8399
grpc = -1
}
Client info
client info and config nearly identical to server above
Operating system and Environment details
Linux, 6.6.23 kernel, Debian bookworm
Log Fragments
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN] agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN] agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)
my guess is that the consul agents are seeing replies to queries they didn't initiate and are confused by that.