consul icon indicating copy to clipboard operation
consul copied to clipboard

"agent.server.serf.lan: serf: query reply ID mismatch" triggered by widespread keyring queries

Open markblackman opened this issue 1 year ago • 1 comments

Overview of the Issue

When reading LAN and WAN keyrings (operator/keyring) from every node in a cluster (~300 node) which is also federated with several other similar cluster, we start seeing the following messages for the WAN and LAN gossip pools:

May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)


Reproduction Steps

Consul info for both Client and Server

Server info
agent:
	check_monitors = 16
	check_ttls = 1
	checks = 30
	services = 26
build:
	prerelease = 
	revision = 7ba4e1b+
	version = 1.13.5+3
	version_metadata = 
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 9
	leader = false
	leader_addr = 10.98.8.4:8498
	server = true
raft:
	applied_index = 2951624713
	commit_index = 2951624713
	fsm_pending = 0
	last_contact = 1.991887ms
	last_log_index = 2951624713
	last_log_term = 7256
	last_snapshot_index = 2951614510
	last_snapshot_term = 7256
	latest_configuration = [{Suffrage:Voter ID:9fbe7e25-0371-578d-ac8e-886b4dfa9a5d Address:10.98.8.4:8498} {Suffrage:Voter ID:542c9d7e-732c-5803-8e36-ba11168bdf99 Address:10.98.8.5:8498} {Suffrage:Voter ID:97e86b31-ae3e-52f7-8746-4876adbbf2c4 Address:10.98.9.44:8498} {Suffrage:Voter ID:5f909fab-fe67-5502-bf8a-f561c8c66e08 Address:10.98.10.65:8498} {Suffrage:Voter ID:8ce84c87-1b7c-5836-aa21-24e5aa087d78 Address:10.98.10.213:8498}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 7256
runtime:
	arch = amd64
	cpu_count = 128
	goroutines = 9258
	max_procs = 128
	os = linux
	version = go1.21.3 X:boringcrypto
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 487
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 150611
	members = 223
	query_queue = 0
	query_time = 310159
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 144768
	members = 33
	query_queue = 0
	query_time = 231467

too much redaction required for the whole thing config.hcl, some highlights below

bind_addr = "0.0.0.0"
serf_wan = "[::]"
translate_wan_addrs = true

ports = {
        server = 8498
        serf_lan = 8499
        http = 8500

        dns = 8600
        serf_wan = 8399
        grpc = -1
}
Client info

client info and config nearly identical to server above

Operating system and Environment details

Linux, 6.6.23 kernel, Debian bookworm

Log Fragments

May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)

markblackman avatar May 07 '24 12:05 markblackman

my guess is that the consul agents are seeing replies to queries they didn't initiate and are confused by that.

markblackman avatar May 07 '24 16:05 markblackman