consul icon indicating copy to clipboard operation
consul copied to clipboard

Consul v1.12.2 - rpc error making call: i/o deadline reached

Open ngcmac opened this issue 2 years ago • 9 comments

Overview of the Issue

After upgrade to Consul v1.12.2, we start seeing on the clients a lot of these errors:

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached"

Although Consul is throwing the erros, everything seems to work ok. We use Consul with ACL + TLS + Connect enabled + Nomad Downgrading to Consul v1.11.3 (our previous version) the errors stop.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 2
build:
	prerelease =
	revision = 19041f20
	version = 1.12.2
	version_metadata =
consul:
	acl = enabled
	known_servers = 3
	server = false
runtime:
	arch = arm64
	cpu_count = 4
	goroutines = 117
	max_procs = 4
	os = linux
	version = go1.18.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 21
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 7
	member_time = 87554
	members = 16
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 1
	checks = 4
	services = 4
build:
	prerelease =
	revision = 19041f20
	version = 1.12.2
	version_metadata =
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = xxx.xxx.xxx.xxx:8300
	server = true
raft:
	applied_index = 8198113
	commit_index = 8198113
	fsm_pending = 0
	last_contact = 0
	last_log_index = 8198113
	last_log_term = 189
	last_snapshot_index = 8188080
	last_snapshot_term = 189
	latest_configuration = [{Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300} {Suffrage:Voter ID:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX Address:xxx.xxx.xxx.xxx:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 189
runtime:
	arch = arm64
	cpu_count = 2
	goroutines = 378
	max_procs = 2
	os = linux
	version = go1.18.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 21
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 7
	member_time = 87556
	members = 16
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 123
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

AWS EC2, Debian11 / Ubuntu Server 20.04 (amd64/amr64)

Log Fragments

[ERROR] agent.client: RPC failed to server: method=ConfigEntry.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=Intention.Match server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.Get server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached" [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx.xxx.xxx.xxx:8300 error="rpc error making call: i/o deadline reached"

ngcmac avatar Jul 08 '22 16:07 ngcmac

Hey @ngcmac

Thanks for reporting this issue! v1.12.2 fixed another reported issue that was emitting this same io deadline reached message, so i'm wondering if this is another manifestation of the same problem.

Are all your servers/clients upgraded to v1.12.2?

Amier3 avatar Jul 08 '22 17:07 Amier3

Hi @Amier3 ,

Yes, we did experience the problems reported on v1.12.1, so we waited until v1.12.2 was released .

We currently experience this issue on 2 or our environments:

  • CI: 1 single Consul server, a few clients. in-house, amd64, ubuntu server 20.04
  • PRE: 3 Consul servers, a few clients. AWS EC2, arm64, Debian11

All the servers and clients are on v1.12.2.

I saw on v1.12.0, GH-12678, could it be related?

Regards.

ngcmac avatar Jul 11 '22 08:07 ngcmac

I'm seeing exactly the same thing on v1.12.2. Our edge proxy is traefik using its native Consul Connect integration. Regularly in the traefik and consul logs there are io deadline reached errors.

From the traefik side:

{"@level":"error","@message":"Watch errored","@module":"consulcatalog.watch","@timestamp":"2022-07-11T12:51:14.297545Z","error":"Unexpected response code: 500 (rpc error making call: i/o deadline reached)","retry":5000000000,"type":"connect_roots"}
{"@level":"error","@message":"Watch errored","@module":"consulcatalog.watch","@timestamp":"2022-07-11T13:01:22.656641Z","error":"Unexpected response code: 500 (rpc error making call: rpc error making call: i/o deadline reached)","retry":5000000000,"type":"connect_roots"}

I can't see any actual impact though, everything seems to be working.

seanamos avatar Jul 11 '22 13:07 seanamos

@seanamos @ngcmac

It might be helpful to see the full, timestamped logs that capture before and after this error message to understand what might be happening. It would also be helpful to see the agent config for one of the clients.

Amier3 avatar Jul 11 '22 14:07 Amier3

I suspect there are some internal RPC calls which expect to block indefinitely but are getting timed out and re-established. @seanamos 's log fragments could suggest that our cache layer is timing out every 10 minutes: https://github.com/hashicorp/consul/blob/12acdd74814bef9d8d8bef1c14dd29dc7da64085/agent/cache-types/options.go#L21

kisunji avatar Jul 11 '22 14:07 kisunji

I suspect there are some internal RPC calls which expect to block indefinitely but are getting timed out and re-established. @seanamos 's log fragments could suggest that our cache layer is timing out every 10 minutes:

https://github.com/hashicorp/consul/blob/12acdd74814bef9d8d8bef1c14dd29dc7da64085/agent/cache-types/options.go#L21

The errors do seem to come in waves every 10 minutes.

seanamos avatar Jul 11 '22 15:07 seanamos

@Amier3 , here's some timestamped logs throwing the error:

Jul 11 15:19:39 <client_name> consul[14732]: 2022-07-11T15:19:39.668+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:21:33 <client_name> consul[14732]: 2022-07-11T15:21:33.908+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:25:02 <client_name> consul[14732]: 2022-07-11T15:25:02.495+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:25:29 <client_name> consul[14732]: 2022-07-11T15:25:29.947+0100 [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:29:49 <client_name> consul[14732]: 2022-07-11T15:29:49.990+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:38:55 <client_name> consul[14732]: 2022-07-11T15:38:55.572+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:42:13 <client_name> consul[14732]: 2022-07-11T15:42:13.375+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:48:04 <client_name> consul[14732]: 2022-07-11T15:48:04.232+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:49:14 <client_name> consul[14732]: 2022-07-11T15:49:14.865+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:52:42 <client_name> consul[14732]: 2022-07-11T15:52:42.034+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 15:55:42 <client_name> consul[14732]: 2022-07-11T15:55:42.234+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached"

I did not find a pattern on the timestamps. Servers are not logging any error btw, i only see this on the clients.

TRACE Logs

Jul 11 11:10:36 <client_name> consul[29915]: 2022-07-11T11:10:36.732+0100 [DEBUG] agent: Check status updated: check=_nomad-check-3fc90f7e9c14f2403475594a87ce237d20c74752 status=passing Jul 11 11:10:41 <client_name> consul[29915]: 2022-07-11T11:10:41.951+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=[::1]:42162 latency=732.789µs Jul 11 11:10:41 <client_name> consul[29915]: 2022-07-11T11:10:41.956+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=[::1]:42164 latency=523.774µs Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.045+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/self from=127.0.0.1:33684 latency=917.242µs Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.572+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 11:10:42 <client_name> consul[29915]: 2022-07-11T11:10:42.573+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9 Jul 11 11:10:43 <client_name> consul[29915]: 2022-07-11T11:10:43.551+0100 [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: <client_name> <client_ip>:8301 Jul 11 11:10:46 <client_name> consul[29915]: 2022-07-11T11:10:46.735+0100 [DEBUG] agent: Check status updated: check=_nomad-check-3fc90f7e9c14f2403475594a87ce237d20c74752 status=passing

INFO Logs

Jul 11 07:30:02 <client_name> consul[14732]: 2022-07-11T07:30:02.546+0100 [ERROR] agent.client: RPC failed to server: method=Intention.Match server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:30:02 <client_name> consul[14732]: 2022-07-11T07:30:02.546+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=intention-match error="rpc error making call: i/o deadline reached" index=9012818 Jul 11 07:36:34 <client_name> consul[14732]: 2022-07-11T07:36:34.748+0100 [ERROR] agent.client: RPC failed to server: method=ConfigEntry.ResolveServiceConfig server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:36:34 <client_name> consul[14732]: 2022-07-11T07:36:34.748+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=resolved-service-config error="rpc error making call: i/o deadline reached" index=9012818 Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=<server_ip>:8300 error="rpc error making call: i/o deadline reached" Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9 Jul 11 07:37:44 <client_name> consul[14732]: 2022-07-11T07:37:44.518+0100 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: i/o deadline reached" index=9

Client main config:
{
  "node_name": "<node_name>",
  "addresses":{
    "http": "0.0.0.0",
    "https": "0.0.0.0",
    "dns": "0.0.0.0",
    "grpc": "0.0.0.0"
  },
  "data_dir": "/opt/consul/data",
  "log_level": "INFO",
  "datacenter": "dc1",
  "bind_addr": "<node_ip>",
  "advertise_addr": "<node_ip>",
  "ports": {
    "server": 8300,
    "serf_lan": 8301,
    "serf_wan": 8302,
    "http": 8500,
    "https": 8501,
    "grpc": 8502,
    "dns": 8600,
    "sidecar_min_port": 21000,
    "sidecar_max_port": 21255,
    "expose_min_port": 21500,
    "expose_max_port":  21755
  },
  "server": false,
  "retry_join": ["<server_dns>"],
  "rejoin_after_leave": true,
  "connect": {
    "enabled": true
  },
  "encrypt": "<some_key>",
  "tls": {
    "defaults": {
      "ca_file": "/opt/consul/certs/consul-ca-cert.pem",
      "cert_file": "/opt/consul/certs/consul-server-cert.pem",
      "key_file": "/opt/consul/certs/consul-server-key.pem",
      "verify_incoming": true,
      "verify_outgoing": true
    },
    "https": {
      "verify_incoming": true
    },
    "internal_rpc": {
      "verify_server_hostname": true,
      "verify_incoming": true
    },
    "grpc": {
      "verify_incoming": false  //https://github.com/hashicorp/consul/issues/13088#
    }
  },
  "auto_encrypt": {
    "tls": true
  },
  "limits": {
    "http_max_conns_per_client": 400,
    "https_handshake_timeout": "15s",
    "rpc_max_conns_per_client": 400,
    "rpc_handshake_timeout": "15s"
  }
}
Client ACL Config
{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

Thks

ngcmac avatar Jul 11 '22 16:07 ngcmac

We ran into this in our test environments today and had to pretty quickly roll back to v1.11. We mostly noticed them around the service health and discovery chain RPCs for our Connect integrated services. I agree they mostly seem to be around the caching layer, with messages such as:

agent.client: RPC failed to server: method=DiscoveryChain.Get server=xx:8300 error="rpc error making call: i/o deadline reached"
agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=37080582

agent.cache: handling error in Cache.Notify: cache-type=health-services error="rpc error making call: i/o deadline reached" index=37072324
agent.client: RPC failed to server: method=Health.ServiceNodes server=xxx:8300 error="rpc error making call: i/o deadline reached"

edit: I think I also back up the idea of the 10 minute timeout based on the volume for each distinct error.

chrisboulton avatar Jul 26 '22 03:07 chrisboulton

I'm still getting this on 1.12.3, goes away after a restart.

MagicRB avatar Jul 29 '22 19:07 MagicRB

I also have those errors in version 1.13.1 (Mesh/Connect in AWS EKS with external servers)

│ 2022-08-12T20:59:07.862Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached"                                                                             │
│ 2022-08-12T20:59:07.862Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824                                                                       │
│ 2022-08-12T20:59:08.203Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached"                                                                             │
│ 2022-08-12T20:59:08.203Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824                                                                       │
│ 2022-08-12T20:59:09.818Z [ERROR] agent.client: RPC failed to server: method=Internal.PeeredUpstreams server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached"                                                                       │
│ 2022-08-12T20:59:09.818Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=peered-upstreams error="rpc error making call: i/o deadline reached" index=1                                                                                    │
│ 2022-08-12T20:59:10.932Z [ERROR] agent.client: RPC failed to server: method=DiscoveryChain.Get server=10.150.14.197:8300 error="rpc error making call: i/o deadline reached"                                                                             │
│ 2022-08-12T20:59:10.933Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=compiled-discovery-chain error="rpc error making call: i/o deadline reached" index=589824

alt-dima avatar Aug 12 '22 21:08 alt-dima

We're also seeing this issue with 1.13.1 with the Catalog.NodeServices method. We are also seeing it in waves of every 10 minutes.

jameshartig avatar Aug 16 '22 19:08 jameshartig

I believe this was caused by #11500. When the stream's FirstReadTimeout is less than the server's blockingQuery timeout then the stream is closed with an i/o deadline exceeded error. This could be caused because there's jitter on both sides and if the client-side jitter was randomly less than the servers then the client will time out before the server. Ideally the client's read timeout should be set based on the maximum possible jitter. @wjordan does that seem plausible?

// Timeout after maximum jitter has elapsed.
q.MaxQueryTime += lib.RandomStagger(q.MaxQueryTime / JitterFraction)

Has the right comment above it but it shouldn't be using RandomStagger I presume and should instead just be:

q.MaxQueryTime += q.MaxQueryTime / JitterFraction

jameshartig avatar Aug 16 '22 21:08 jameshartig

Could very well be that, i got this error between agents in two different geographic locations with an unstable link

MagicRB avatar Aug 17 '22 10:08 MagicRB

I verified the above fix on our dev environment. We saw no RPC failed to server for the last 12 hours after deploying that fix.

jameshartig avatar Aug 17 '22 13:08 jameshartig

Thank you @jameshartig for the investigation and quick fix 🕵️ It will be backported to the next patch versions of 1.12.5 and 1.13.2

kisunji avatar Aug 17 '22 14:08 kisunji

I own the code for a Go client that uses the Hashicorp consul Go api. We are currently encountering this error a lot. I see that a fix will be out soon. Is there a change that I could make on the client side to avoid this error? If so, which timeout value would I need to change and what should I change it to? We are using the Consul 1.12.2 server and the 1.12.0 consul client api. Thanks.

jofowler avatar Sep 20 '22 14:09 jofowler

Could very well be that, i got this error between agents in two different geographic locations with an unstable link

Could this occur on agents that have a stable link, but across a far distance? I have a dev cluster on 1.13.2, which supposedly has the fix, but i still get i/o deadline reached when interacting with the consul dashboard/api. Most our agents/leaders are in chicago, but we have another few that live off in London/Germany

I'm using a raft_mulitplier of 1. I wonder if i need to increase that

havedill avatar Nov 11 '22 14:11 havedill

Could very well be that, i got this error between agents in two different geographic locations with an unstable link

Could this occur on agents that have a stable link, but across a far distance? I have a dev cluster on 1.13.2, which supposedly has the fix, but i still get i/o deadline reached when interacting with the consul dashboard/api. Most our agents/leaders are in chicago, but we have another few that live off in London/Germany

I'm using a raft_mulitplier of 1. I wonder if i need to increase that

Are both clients and servers on 1.13.2? RPC timeouts can technically still happen with large enough clusters if they exceed 7 seconds while building a response.

A similar issue was raised in https://github.com/hashicorp/consul/issues/14732. In 1.13.3 we introduced rpc_client_timeout docs which defaults to 60s (instead of 7s rpc_hold_timeout) and can be tuned with no impact to raft stability.

kisunji avatar Nov 14 '22 14:11 kisunji