consul
consul copied to clipboard
Consul v1.12.2 - rpc error making call: i/o deadline reached
Overview of the Issue
After upgrade to Consul v1.12.2, we start seeing on the clients a lot of these errors:
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached"
Although Consul is throwing the erros, everything seems to work ok. We use Consul with ACL + TLS + Connect enabled + Nomad Downgrading to Consul v1.11.3 (our previous version) the errors stop.
Consul info for both Client and Server
Client info
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 2
build:
prerelease =
revision = 19041f20
version = 1.12.2
version_metadata =
consul:
acl = enabled
known_servers = 3
server = false
runtime:
arch = arm64
cpu_count = 4
goroutines = 117
max_procs = 4
os = linux
version = go1.18.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 21
failed = 0
health_score = 0
intent_queue = 0
left = 7
member_time = 87554
members = 16
query_queue = 0
query_time = 1
Server info
agent:
check_monitors = 0
check_ttls = 1
checks = 4
services = 4
build:
prerelease =
revision = 19041f20
version = 1.12.2
version_metadata =
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = xxx.xxx.xxx.xxx:8300
server = true
raft:
applied_index = 8198113
commit_index = 8198113
fsm_pending = 0
last_contact = 0
last_log_index = 8198113
last_log_term = 189
last_snapshot_index = 8188080
last_snapshot_term = 189
latest_configuration = [{Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 189
runtime:
arch = arm64
cpu_count = 2
goroutines = 378
max_procs = 2
os = linux
version = go1.18.1
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 21
failed = 0
health_score = 0
intent_queue = 0
left = 7
member_time = 87556
members = 16
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 123
members = 3
query_queue = 0
query_time = 1
Operating system and Environment details
AWS EC2, Debian11 / Ubuntu Server 20.04 (amd64/amr64)
Log Fragments
[ERROR] agent.client: RPC failed to server: method=ConfigEntry.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached"
Hey @ngcmac
Thanks for reporting this issue! v1.12.2 fixed another reported issue that was emitting this same io deadline reached
message, so i'm wondering if this is another manifestation of the same problem.
Are all your servers/clients upgraded to v1.12.2?
Hi @Amier3 ,
Yes, we did experience the problems reported on v1.12.1, so we waited until v1.12.2 was released .
We currently experience this issue on 2 or our environments:
- CI: 1 single Consul server, a few clients. in-house, amd64, ubuntu server 20.04
- PRE: 3 Consul servers, a few clients. AWS EC2, arm64, Debian11
All the servers and clients are on v1.12.2.
I saw on v1.12.0, GH-12678, could it be related?
Regards.
I'm seeing exactly the same thing on v1.12.2
. Our edge proxy is traefik using its native Consul Connect integration. Regularly in the traefik and consul logs there are io deadline reached
errors.
From the traefik side:
{"@level":"error","@message":"Watch errored","@module":"consulcatalog.watch","@timestamp":"2022-07-11T12:51:14.297545Z","error":"Unexpected response code: 500 (rpc error making call: i/o deadline reached)","retry":5000000000,"type":"connect_roots"}
{"@level":"error","@message":"Watch errored","@module":"consulcatalog.watch","@timestamp":"2022-07-11T13:01:22.656641Z","error":"Unexpected response code: 500 (rpc error making call: rpc error making call: i/o deadline reached)","retry":5000000000,"type":"connect_roots"}
I can't see any actual impact though, everything seems to be working.
@seanamos @ngcmac
It might be helpful to see the full, timestamped logs that capture before and after this error message to understand what might be happening. It would also be helpful to see the agent config for one of the clients.
I suspect there are some internal RPC calls which expect to block indefinitely but are getting timed out and re-established. @seanamos 's log fragments could suggest that our cache layer is timing out every 10 minutes: https://github.com/hashicorp/consul/blob/12acdd74814bef9d8d8bef1c14dd29dc7da64085/agent/cache-types/options.go#L21
I suspect there are some internal RPC calls which expect to block indefinitely but are getting timed out and re-established. @seanamos 's log fragments could suggest that our cache layer is timing out every 10 minutes:
https://github.com/hashicorp/consul/blob/12acdd74814bef9d8d8bef1c14dd29dc7da64085/agent/cache-types/options.go#L21
The errors do seem to come in waves every 10 minutes.
@Amier3 , here's some timestamped logs throwing the error:
Jul 11 15:19:39 <client_name> consul[14732]: 2022-07-11T15:19:39.668+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:21:33 <client_name> consul[14732]: 2022-07-11T15:21:33.908+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:25:02 <client_name> consul[14732]: 2022-07-11T15:25:02.495+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:25:29 <client_name> consul[14732]: 2022-07-11T15:25:29.947+0100 [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:29:49 <client_name> consul[14732]: 2022-07-11T15:29:49.990+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:38:55 <client_name> consul[14732]: 2022-07-11T15:38:55.572+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:42:13 <client_name> consul[14732]: 2022-07-11T15:42:13.375+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:48:04 <client_name> consul[14732]: 2022-07-11T15:48:04.232+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:49:14 <client_name> consul[14732]: 2022-07-11T15:49:14.865+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:52:42 <client_name> consul[14732]: 2022-07-11T15:52:42.034+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:55:42 <client_name> consul[14732]: 2022-07-11T15:55:42.234+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached"
I did not find a pattern on the timestamps. Servers are not logging any error btw, i only see this on the clients.
TRACE Logs
Jul 11 11:10:36 <client_name> consul[29915]: 2022-07-11T11:10:36.732+0100 [DEBUG] agent: Check status updated: check=_nomad-check-3fc90f7e9c14f2403475594a87ce237d20c74752 status=passing Jul 11 11:10:41 <client_name> consul[29915]: 2022-07-11T11:10:41.951+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=[::1]:42162 latency=732.789µs Jul 11 11:10:41 <client_name> consul[29915]: 2022-07-11T11:10:41.956+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=[::1]:42164 latency=523.774µs Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.045+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/self from=127.0.0.1:33684 latency=917.242µs Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.572+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.573+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9 Jul 11 11:10:43 <client_name> consul[29915]: 2022-07-11T11:10:43.551+0100 [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: <client_name> <client_ip>:8301 Jul 11 11:10:46 <client_name> consul[29915]: 2022-07-11T11:10:46.735+0100 [DEBUG] agent: Check status updated: check=_nomad-check-3fc90f7e9c14f2403475594a87ce237d20c74752 status=passing
INFO Logs
Jul 11 07:30:02 <client_name> consul[14732]: 2022-07-11T07:30:02.546+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:30:02 <client_name> consul[14732]: 2022-07-11T07:30:02.546+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=intention-match error="rpc error making call: i/o deadline reached" index=9012818 Jul 11 07:36:34 <client_name> consul[14732]: 2022-07-11T07:36:34.748+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:36:34 <client_name> consul[14732]: 2022-07-11T07:36:34.748+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=resolved-service-config error="rpc error making call: i/o deadline reached" index=9012818 Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9 Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9
Client main config:
{
"node_name": "<node_name>",
"addresses":{
"http": "0.0.0.0",
"https": "0.0.0.0",
"dns": "0.0.0.0",
"grpc": "0.0.0.0"
},
"data_dir": "/opt/consul/data",
"log_level": "INFO",
"datacenter": "dc1",
"bind_addr": "<node_ip>",
"advertise_addr": "<node_ip>",
"ports": {
"server": 8300,
"serf_lan": 8301,
"serf_wan": 8302,
"http": 8500,
"https": 8501,
"grpc": 8502,
"dns": 8600,
"sidecar_min_port": 21000,
"sidecar_max_port": 21255,
"expose_min_port": 21500,
"expose_max_port": 21755
},
"server": false,
"retry_join": ["<server_dns>"],
"rejoin_after_leave": true,
"connect": {
"enabled": true
},
"encrypt": "<some_key>",
"tls": {
"defaults": {
"ca_file": "/opt/consul/certs/consul-ca-cert.pem",
"cert_file": "/opt/consul/certs/consul-server-cert.pem",
"key_file": "/opt/consul/certs/consul-server-key.pem",
"verify_incoming": true,
"verify_outgoing": true
},
"https": {
"verify_incoming": true
},
"internal_rpc": {
"verify_server_hostname": true,
"verify_incoming": true
},
"grpc": {
"verify_incoming": false //https://github.com/hashicorp/consul/issues/13088#
}
},
"auto_encrypt": {
"tls": true
},
"limits": {
"http_max_conns_per_client": 400,
"https_handshake_timeout": "15s",
"rpc_max_conns_per_client": 400,
"rpc_handshake_timeout": "15s"
}
}
Client ACL Config
{
"acl": {
"enabled": true,
"tokens": {
"agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
},
"default_policy": "deny",
"down_policy": "extend-cache",
"enable_token_persistence": true
}
}
Thks
We ran into this in our test environments today and had to pretty quickly roll back to v1.11. We mostly noticed them around the service health and discovery chain RPCs for our Connect integrated services. I agree they mostly seem to be around the caching layer, with messages such as:
agent.client: RPC failed to server: method=DiscoveryChain.Get server=xx:8300 error="rpc error making call: i/o deadline reached"
agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=37080582
agent.cache: handling error in Cache.Notify: cache-type=health-services error="rpc error making call: i/o deadline reached" index=37072324
agent.client: RPC failed to server: method=Health.ServiceNodes server=xxx:8300 error="rpc error making call: i/o deadline reached"
edit: I think I also back up the idea of the 10 minute timeout based on the volume for each distinct error.
I'm still getting this on 1.12.3, goes away after a restart.
I also have those errors in version 1.13.1 (Mesh/Connect in AWS EKS with external servers)
│ 2022-08-12T20:59:07.862Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached" │
│ 2022-08-12T20:59:07.862Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824 │
│ 2022-08-12T20:59:08.203Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached" │
│ 2022-08-12T20:59:08.203Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824 │
│ 2022-08-12T20:59:09.818Z [ERROR] agent.client: RPC failed to server: method=Internal.PeeredUpstreams server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached" │
│ 2022-08-12T20:59:09.818Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=peered-upstreams error="rpc error making call: i/o deadline reached" index=1 │
│ 2022-08-12T20:59:10.932Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached" │
│ 2022-08-12T20:59:10.933Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824
We're also seeing this issue with 1.13.1 with the Catalog.NodeServices
method. We are also seeing it in waves of every 10 minutes.
I believe this was caused by #11500. When the stream's FirstReadTimeout
is less than the server's blockingQuery
timeout then the stream is closed with an i/o deadline exceeded
error. This could be caused because there's jitter on both sides and if the client-side jitter was randomly less than the servers then the client will time out before the server. Ideally the client's read timeout should be set based on the maximum possible jitter. @wjordan does that seem plausible?
// Timeout after maximum jitter has elapsed.
q.MaxQueryTime += lib.RandomStagger(q.MaxQueryTime / JitterFraction)
Has the right comment above it but it shouldn't be using RandomStagger
I presume and should instead just be:
q.MaxQueryTime += q.MaxQueryTime / JitterFraction
Could very well be that, i got this error between agents in two different geographic locations with an unstable link
I verified the above fix on our dev environment. We saw no RPC failed to server
for the last 12 hours after deploying that fix.
Thank you @jameshartig for the investigation and quick fix 🕵️ It will be backported to the next patch versions of 1.12.5 and 1.13.2
I own the code for a Go client that uses the Hashicorp consul Go api. We are currently encountering this error a lot. I see that a fix will be out soon. Is there a change that I could make on the client side to avoid this error? If so, which timeout value would I need to change and what should I change it to? We are using the Consul 1.12.2 server and the 1.12.0 consul client api. Thanks.
Could very well be that, i got this error between agents in two different geographic locations with an unstable link
Could this occur on agents that have a stable link, but across a far distance? I have a dev cluster on 1.13.2, which supposedly has the fix, but i still get i/o deadline reached when interacting with the consul dashboard/api. Most our agents/leaders are in chicago, but we have another few that live off in London/Germany
I'm using a raft_mulitplier of 1. I wonder if i need to increase that
Could very well be that, i got this error between agents in two different geographic locations with an unstable link
Could this occur on agents that have a stable link, but across a far distance? I have a dev cluster on 1.13.2, which supposedly has the fix, but i still get i/o deadline reached when interacting with the consul dashboard/api. Most our agents/leaders are in chicago, but we have another few that live off in London/Germany
I'm using a raft_mulitplier of 1. I wonder if i need to increase that
Are both clients and servers on 1.13.2? RPC timeouts can technically still happen with large enough clusters if they exceed 7 seconds while building a response.
A similar issue was raised in https://github.com/hashicorp/consul/issues/14732. In 1.13.3 we introduced rpc_client_timeout
docs which defaults to 60s (instead of 7s rpc_hold_timeout
) and can be tuned with no impact to raft stability.