cloud-sql-proxy
cloud-sql-proxy copied to clipboard
Intermittent "connection aborted - error reading from instance" errors with auth proxy as a sidecar on Cloud Run
Bug Description
I have a Cloud Run service running with a cloud sql auth proxy sidecar to connect to a set of CloudSQL instances (currently, 5 of them). Several instances of the service can coexist at any given time. Sometimes, with increasing frequency (used to be once a month or so, it's getting to several times a week recently), all the connections to CloudSQL in once instance error out with the following error logs
'[project_id:europe-west1:instance_2] connection aborted - error reading from instance: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_2] IO Error on Read or Write: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
It always happens on all connected instances at the same time, for one given instance of the proxy. As far as we have been able to observe, there is no visible correlation between this issue occurring and any sort of high load on the cloud run service, or the databases it connects to.
Example code (or command)
Intermittent error that does not seem related to any particular lines of code (see below for proxy options).
Stacktrace
'[project_id:europe-west1:instance_2] connection aborted - error reading from instance: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_2] IO Error on Read or Write: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] connection aborted - error reading from instance: read tcp 169.254.8.1:58109->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] IO Error on Read or Write: read tcp 169.254.8.1:58109->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] connection aborted - error reading from instance: read tcp 169.254.8.1:33878->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] IO Error on Read or Write: read tcp 169.254.8.1:33878->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] connection aborted - error reading from instance: read tcp 169.254.8.1:44952->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] IO Error on Read or Write: read tcp 169.254.8.1:44952->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] connection aborted - error reading from instance: read tcp 169.254.8.1:29766->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] connection aborted - error reading from instance: read tcp 169.254.8.1:60901->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] connection aborted - error reading from instance: read tcp 169.254.8.1:53263->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] IO Error on Read or Write: read tcp 169.254.8.1:29766->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] IO Error on Read or Write: read tcp 169.254.8.1:60901->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] IO Error on Read or Write: read tcp 169.254.8.1:53263->{instance_ip}:3307: read: connection reset by peer
Steps to reproduce?
I don't really trigger the bug, it just happens sometimes. The frequency seems to be increasing recently.
Environment
- OS type and version: Docker container on Cloud Run
- The sidecar container so far had 500m vCPU allocated (half a vCPU) - I changed it to 1 full vCPU today, waiting to see if the issue occurs again.
- Cloud SQL Proxy version : gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.16.0
- Proxy invocation command :
args = [
"--unix-socket=/cloudsql",
"--structured-logs",
"--health-check",
"--http-address=0.0.0.0",
"--max-sigterm-delay=10s", // wait 10sec max before closing all connections when the container receives SIGTERM. Should be longer than the condition applied in the client code, if any.
"--debug-logs",
"--lazy-refresh",
]
"--lazy-refresh", has been recently added to see if it fixes the issue, to no avail.
Additional Details
No response