consul GRPC request rate spikes turning service discovery unreliabile

Overview of the Issue

To be perfectly honest, I am not sure if this is a bug, but I would like to get some feedback.

We use five consul servers and 10+ consul agents. Recently we converted our Prometheus stack from server to agent, using consul service discovery. Obviously, with such a change, consul server requests increased multiple times. As a result, consul service discovery started being inconsistent, constantly missing certain services and in consul resulting in error messages as such:

[ERROR] agent.rpcclient.health: subscribe call failed: err="stream reset requested" topic=ServiceHealth key=node-exporter failure_count=1

Could someone elaborate on this message 👆 ?

I found a correlation between the grpc request count with the previously mentioned issue. Every time grpc requests spike, service discovery starts to flap. Does anyone know why the grpc request rate could be so erratic as any stack components don't do such bursts? Is grpc buffered in any way?

I imagine it should look more like this:

also exploring telemetry, I can see that every new grpc request opens a dedicated stream (consul.grpc.server.request.count == consul.grpc.server.stream.count), so it works in a Unary request model?

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with five server nodes
Have at least 50 nodes, registering services and reading service list with service discovery
Catch metric gaps, as consul sd starts missing some targets

Consul info for Server

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 5
build:
	prerelease =
	revision = db839f18
	version = 1.10.1
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = [2a02:4780:8:a::7]:8300
	server = true
raft:
	applied_index = 166325506
	commit_index = 166325506
	fsm_pending = 0
	last_contact = 66.952093ms
	last_log_index = 166325507
	last_log_term = 32648
	last_snapshot_index = 166316351
	last_snapshot_term = 32648
	latest_configuration = [<list of hosts>]
	latest_configuration_index = 0
	num_peers = 5
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 32648
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 3156
	max_procs = 8
	os = linux
	version = go1.16.6
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1676
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 45671
	members = 89
	query_queue = 0
	query_time = 343
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 7590
	members = 1
	query_queue = 0
	query_time = 1

Jul 01 '22 15:07 LukoJy3D

also exploring telemetry, I can see that every new grpc request opens a dedicated stream (consul.grpc.server.request.count == consul.grpc.server.stream.count), so it works in a Unary request model?

I think the only grpc connection for the consul version you use is for streaming, so every client grpc request establishes a stream to server. Therefore, the two counts are equal.

Jul 05 '22 18:07 huikang

@huikang, I appreciate you taking the time to answer this. Yes, it seems like most of the grpc connections are coming from our Prometheus stack due to set refresh interval. However, I can not find proper info that would help me decide if these grpc spikes are the cause or only the symptoms of my missing Prometheus service targets 😮‍💨

Jul 25 '22 13:07 LukoJy3D

consul consul copied to clipboard

GRPC request rate spikes turning service discovery unreliabile

Overview of the Issue

Reproduction Steps

Consul info for Server

consul
consul copied to clipboard