consul /v1/health/service/:service?cached endpoint returns no entries at all even when the uncached endpoint does

Overview of the Issue

/v1/health/service/:service?cached endpoint returns no entries at all even when the uncached endpoint does

Reproduction Steps

Any size cluster, any configuration, restart a client configured node as a server by adding some extra JSON config like

{"leave_on_terminate": true, "server": true, "skip_leave_on_interrupt": false, "node_meta": {"resilience_node": "true"}}

Then some of the time, query a service using the cached endpoint and get an empty list, but the same query without the ?cached parameter produces a full list of nodes.

curl -s "http://127.0.0.1:8500/v1/health/service/service-x?cached"
[]

curl -s "http://127.0.0.1:8500/v1/health/service/service-x"
[{...},{...}]

Consul info for both Client and Server

Client info

x@y~$ consul info
agent:
	check_monitors = 35
	check_ttls = 0
	checks = 86
	services = 72
build:
	prerelease = 
	revision = 895390c+
	version = 1.16.6
	version_metadata = 
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 2
	leader = true
	leader_addr = 10.131.8.20:8498
	server = true
raft:
	applied_index = 2879871686
	commit_index = 2879871686
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2879871686
	last_log_term = 43501
	last_snapshot_index = 2879866307
	last_snapshot_term = 43501
	latest_configuration = [{Suffrage:Voter ID:9b8e04a0-72cc-5f4b-b3d8-c21e9b05792d Address:10.131.8.20:8498} {Suffrage:Voter ID:be43dee9-05dd-57c4-b334-55e7127c3d18 Address:10.131.8.15:8498} {Suffrage:Voter ID:85a27efe-0d54-5ec9-8c0a-5e3531752d73 Address:10.131.8.14:8498}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 43501
runtime:
	arch = amd64
	cpu_count = 192
	goroutines = 2575
	max_procs = 192
	os = linux
	version = go1.22.3 X:boringcrypto
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2837
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1031477
	members = 14
	query_queue = 0
	query_time = 96884
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 5
	member_time = 966242
	members = 11
	query_queue = 0
	query_time = 57106

Client agent HCL config

Server info

Output from server 'consul info' command here

Server agent HCL config

Operating system and Environment details

Linux, amd64 and arm64, Debian bookworm

Log Fragments

no log errors noted

Jul 15 '24 16:07 markblackman

still observed with 1.18.2

Aug 06 '24 15:08 markblackman

a more reliable way to reproduce the effect is with two terminal sessions on a peer server node.

terminal session A on a server node:

while(true);do curl -s 'http://127.0.0.1:8500/v1/health/service/service-x?cached&ua=cli-query' | jq length; sleep 0.25; done

terminal session B on same server node:

sudo systemctl stop consul; sudo rm -rf /state/consul/raft; sudo systemctl start consul

observe that the number of service-x instances (from the cached endpoint) in session A drops from some non-zero number to zero, until consul is restarted again or the upstream connections to the leader are killed off with ss -K 'dst :8498

Aug 06 '24 16:08 markblackman