cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Connection test occasionally causes TLS handshake failed errors

Open mvhatch opened this issue 1 year ago • 15 comments

Bug Description

gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0 and gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1 takes 30s to establish a connection when using --run-connection-test. This delay intermittently introduces proxy connection errors.

Stacktrace

Version v2.8.1


2024/01/08 23:07:43 Authorizing with OAuth2 token
2024/01/08 23:07:44 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/08 23:07:44 The proxy has started successfully and is ready for new connections!
2024/01/08 23:07:44 Connection test started
2024/01/08 23:08:14 Connection test passed
2024/01/08 23:08:14 [<project>:<region>:<instance>] accepted connection from 192.168.192.3:35240
2024/01/08 23:08:44 [<project>:<region>:<instance>] failed to connect to instance: Dial error: handshake failed (connection name = "<project>:<region>:<instance>"): context deadline exceeded
pg_dump: error: connection to database "<databse>" failed: connection to server at "pg_dump--proxy" (192.168.192.2), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Version v2.7.2

2024/01/09 19:24:00 Authorizing with OAuth2 token
2024/01/09 19:24:00 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/09 19:24:00 The proxy has started successfully and is ready for new connections!
2024/01/09 19:24:00 Connection test started
2024/01/09 19:24:00 Connection test passed
2024/01/09 19:24:01 [<project>:<region:<instance>] accepted connection from 172.23.0.3:60642

Steps to reproduce?

Start a proxy container pointing to an instance, run the connection test as part of startup

  1. docker run --rm gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0 <project>:<region>:<instance> --run-connection-test

Environment

  1. OS type and version:
uname -a
Linux 92671f21852e 6.4.16-linuxkit #1 SMP PREEMPT Wed Oct 25 16:32:24 UTC 2023 x86_64 GNU/Linux 
  1. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.8.1+container
  2. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):

Additional Details

No response

mvhatch avatar Jan 09 '24 20:01 mvhatch

How are you connecting to the Proxy in the container?

enocom avatar Jan 09 '24 21:01 enocom

FWIW I do this:

docker run --rm -p 5432:5432 gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1 \
  --address 0.0.0.0 <INSTANCE_CONNECTION_NAME> \
  --run-connection-test --token $(gcloud auth print-access-token)

And then connect with psql from my host machine. This works without issue.

enocom avatar Jan 09 '24 21:01 enocom

@enocom - I have no issue connecting my host machine to the proxy; I have an intermittent issue w/ the proxy connecting to cloud sql. However, I have no issue with the proxy connecting to cloud sql when using v2.7.2. The only thing I noticed was a timeout error in v2.8.1.

mvhatch avatar Jan 09 '24 21:01 mvhatch

How often do you see this timeout error? Also I assume you're connecting to public IP?

enocom avatar Jan 09 '24 22:01 enocom

How often do you see this timeout error? Also I assume you're connecting to public IP?

The timeout error has been happening ~50% of the time when running a workload that uses v2.8.1. It is connecting to the public IP.

mvhatch avatar Jan 09 '24 22:01 mvhatch

Thanks, @mvhatch. In that case, let me hammer this some more to see if I can force out the same problem. Have you tried reproducing outside of docker?

enocom avatar Jan 09 '24 22:01 enocom

I haven't reproduced this yet, but realize I've seen this same problem when people use the readiness probe.

The key error is:

failed to connect to instance: Dial error: 
  handshake failed (connection name = "<project>:<region>:<instance>"): 
  context deadline exceeded

enocom avatar Jan 10 '24 17:01 enocom

Idk if this is helpful, but I've also seen a timeout error in addition to the context deadline exceeded error:

2024/01/08 22:37:09 Authorizing with OAuth2 token
2024/01/08 22:37:09 <project>:<region>:<instance> Listening on [::]:5432
2024/01/08 22:37:09 The proxy has started successfully and is ready for new connections!
2024/01/08 22:37:09 Connection test started
2024/01/08 22:37:39 Connection test passed
2024/01/08 22:37:39 <project>:<region>:<instance> accepted connection from 192.168.96.3:38646
2024/01/08 22:38:09 <project>:<region>:<instance> failed to connect to instance: Dial error: failed to dial (connection name = "<project>:<region>:<instance>"): dial tcp 35.238.253.130: i/o timeout
pg_dump: error: connection to database "<db>" failed: connection to server at "pg_dump--proxy" (192.168.96.2), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

mvhatch avatar Jan 10 '24 17:01 mvhatch

perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x?

mvhatch avatar Jan 10 '24 17:01 mvhatch

Yes that is helpful. It implies the remote end of the socket isn't responding, whether for the initial connection or the TLS handshake.

enocom avatar Jan 10 '24 17:01 enocom

perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x?

That's still unclear and entirely unexpected.

enocom avatar Jan 10 '24 17:01 enocom

FWIW I'd expect the connection test to be only used when you're uncertain if the network path is open. Otherwise, I wouldn't use it regularly. In the meantime, we'll sort out this bug.

enocom avatar Jan 10 '24 20:01 enocom

I have a similar issue when running in dataflow. I am connecting using private ip. It's not very often, something like ~5%, but it causes my dataflow job to fail after spending an hour hung up on this.

curl -o /bin/cloud_sql_proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.9.0/cloud-sql-proxy.linux.amd64

/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &

2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

akshetpandey avatar Apr 22 '24 17:04 akshetpandey

@akshetpandey Are you using the --run-connection-test flag? This is for what seems to be a race condition when that flag is enabled.

If you're getting TLS errors otherwise, feel free to open a new issue. At a glance, I'd be curious to know what kind of CPU usage you see when that TLS handshake timeout occurs. Sometimes that can be a symptom of the Proxy being resource starved.

enocom avatar Apr 25 '24 22:04 enocom

It isn't related to that flag, although I should use it to work around the infinite loop I am facing because of this! Will create a new issue

akshetpandey avatar May 01 '24 21:05 akshetpandey

I think this is a duplicate of https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2224 and https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2212, the first of the two being the more relevant description.

We're going to cut a release next week with a fix. If I'm mistaken and this is still happening after the release next week, feel free to re-open. You can try this now by building against the latest cloudsqlconn (https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1).

enocom avatar May 24 '24 04:05 enocom