/v1/health/service/:service?cached endpoint returns no entries at all even when the uncached endpoint does
Overview of the Issue
/v1/health/service/:service?cached endpoint returns no entries at all even when the uncached endpoint does
Reproduction Steps
Any size cluster, any configuration, restart a client configured node as a server by adding some extra JSON config like
{"leave_on_terminate": true, "server": true, "skip_leave_on_interrupt": false, "node_meta": {"resilience_node": "true"}}
Then some of the time, query a service using the cached endpoint and get an empty list, but the same query without the ?cached parameter produces a full list of nodes.
curl -s "http://127.0.0.1:8500/v1/health/service/service-x?cached"
[]
curl -s "http://127.0.0.1:8500/v1/health/service/service-x"
[{...},{...}]
Consul info for both Client and Server
Client info
x@y~$ consul info
agent:
check_monitors = 35
check_ttls = 0
checks = 86
services = 72
build:
prerelease =
revision = 895390c+
version = 1.16.6
version_metadata =
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = true
leader_addr = 10.131.8.20:8498
server = true
raft:
applied_index = 2879871686
commit_index = 2879871686
fsm_pending = 0
last_contact = 0
last_log_index = 2879871686
last_log_term = 43501
last_snapshot_index = 2879866307
last_snapshot_term = 43501
latest_configuration = [{Suffrage:Voter ID:9b8e04a0-72cc-5f4b-b3d8-c21e9b05792d Address:10.131.8.20:8498} {Suffrage:Voter ID:be43dee9-05dd-57c4-b334-55e7127c3d18 Address:10.131.8.15:8498} {Suffrage:Voter ID:85a27efe-0d54-5ec9-8c0a-5e3531752d73 Address:10.131.8.14:8498}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 43501
runtime:
arch = amd64
cpu_count = 192
goroutines = 2575
max_procs = 192
os = linux
version = go1.22.3 X:boringcrypto
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2837
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1031477
members = 14
query_queue = 0
query_time = 96884
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 5
member_time = 966242
members = 11
query_queue = 0
query_time = 57106
Client agent HCL config
Server info
Output from server 'consul info' command here
Server agent HCL config
Operating system and Environment details
Linux, amd64 and arm64, Debian bookworm
Log Fragments
no log errors noted
still observed with 1.18.2
a more reliable way to reproduce the effect is with two terminal sessions on a peer server node.
terminal session A on a server node:
while(true);do curl -s 'http://127.0.0.1:8500/v1/health/service/service-x?cached&ua=cli-query' | jq length; sleep 0.25; done
terminal session B on same server node:
sudo systemctl stop consul; sudo rm -rf /state/consul/raft; sudo systemctl start consul
observe that the number of service-x instances (from the cached endpoint) in session A drops from some non-zero number to zero, until consul is restarted again or the upstream connections to the leader are killed off with ss -K 'dst :8498