consul
consul copied to clipboard
GRPC request rate spikes turning service discovery unreliabile
Overview of the Issue
To be perfectly honest, I am not sure if this is a bug, but I would like to get some feedback.
We use five consul servers and 10+ consul agents. Recently we converted our Prometheus stack from server to agent, using consul service discovery. Obviously, with such a change, consul server requests increased multiple times. As a result, consul service discovery started being inconsistent, constantly missing certain services and in consul resulting in error messages as such:
[ERROR] agent.rpcclient.health: subscribe call failed: err="stream reset requested" topic=ServiceHealth key=node-exporter failure_count=1
Could someone elaborate on this message 👆 ?
I found a correlation between the grpc request count with the previously mentioned issue. Every time grpc requests spike, service discovery starts to flap. Does anyone know why the grpc request rate could be so erratic as any stack components don't do such bursts? Is grpc buffered in any way?
I imagine it should look more like this:
also exploring telemetry, I can see that every new grpc request opens a dedicated stream (consul.grpc.server.request.count == consul.grpc.server.stream.count
), so it works in a Unary request model?
Reproduction Steps
Steps to reproduce this issue, eg:
- Create a cluster with five server nodes
- Have at least 50 nodes, registering services and reading service list with service discovery
- Catch metric gaps, as consul sd starts missing some targets
Consul info for Server
Server info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 5
build:
prerelease =
revision = db839f18
version = 1.10.1
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = [2a02:4780:8:a::7]:8300
server = true
raft:
applied_index = 166325506
commit_index = 166325506
fsm_pending = 0
last_contact = 66.952093ms
last_log_index = 166325507
last_log_term = 32648
last_snapshot_index = 166316351
last_snapshot_term = 32648
latest_configuration = [<list of hosts>]
latest_configuration_index = 0
num_peers = 5
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 32648
runtime:
arch = amd64
cpu_count = 8
goroutines = 3156
max_procs = 8
os = linux
version = go1.16.6
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1676
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 45671
members = 89
query_queue = 0
query_time = 343
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 7590
members = 1
query_queue = 0
query_time = 1
also exploring telemetry, I can see that every new grpc request opens a dedicated stream (consul.grpc.server.request.count == consul.grpc.server.stream.count), so it works in a Unary request model?
I think the only grpc connection for the consul version you use is for streaming, so every client grpc request establishes a stream to server. Therefore, the two counts are equal.
@huikang, I appreciate you taking the time to answer this. Yes, it seems like most of the grpc connections are coming from our Prometheus stack due to set refresh interval. However, I can not find proper info that would help me decide if these grpc spikes are the cause or only the symptoms of my missing Prometheus service targets 😮💨