cloud-sql-proxy
cloud-sql-proxy copied to clipboard
Connection test occasionally causes TLS handshake failed errors
Bug Description
gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
and gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1
takes 30s to establish a connection when using --run-connection-test
. This delay intermittently introduces proxy connection errors.
Stacktrace
Version v2.8.1
2024/01/08 23:07:43 Authorizing with OAuth2 token
2024/01/08 23:07:44 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/08 23:07:44 The proxy has started successfully and is ready for new connections!
2024/01/08 23:07:44 Connection test started
2024/01/08 23:08:14 Connection test passed
2024/01/08 23:08:14 [<project>:<region>:<instance>] accepted connection from 192.168.192.3:35240
2024/01/08 23:08:44 [<project>:<region>:<instance>] failed to connect to instance: Dial error: handshake failed (connection name = "<project>:<region>:<instance>"): context deadline exceeded
pg_dump: error: connection to database "<databse>" failed: connection to server at "pg_dump--proxy" (192.168.192.2), port 5432 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Version v2.7.2
2024/01/09 19:24:00 Authorizing with OAuth2 token
2024/01/09 19:24:00 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/09 19:24:00 The proxy has started successfully and is ready for new connections!
2024/01/09 19:24:00 Connection test started
2024/01/09 19:24:00 Connection test passed
2024/01/09 19:24:01 [<project>:<region:<instance>] accepted connection from 172.23.0.3:60642
Steps to reproduce?
Start a proxy container pointing to an instance, run the connection test as part of startup
-
docker run --rm gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0 <project>:<region>:<instance> --run-connection-test
Environment
- OS type and version:
uname -a
Linux 92671f21852e 6.4.16-linuxkit #1 SMP PREEMPT Wed Oct 25 16:32:24 UTC 2023 x86_64 GNU/Linux
- Cloud SQL Proxy version (
./cloud-sql-proxy --version
):2.8.1+container
- Proxy invocation command (for example,
./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME
):
Additional Details
No response
How are you connecting to the Proxy in the container?
FWIW I do this:
docker run --rm -p 5432:5432 gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1 \
--address 0.0.0.0 <INSTANCE_CONNECTION_NAME> \
--run-connection-test --token $(gcloud auth print-access-token)
And then connect with psql from my host machine. This works without issue.
@enocom - I have no issue connecting my host machine to the proxy; I have an intermittent issue w/ the proxy connecting to cloud sql. However, I have no issue with the proxy connecting to cloud sql when using v2.7.2. The only thing I noticed was a timeout error in v2.8.1.
How often do you see this timeout error? Also I assume you're connecting to public IP?
How often do you see this timeout error? Also I assume you're connecting to public IP?
The timeout error has been happening ~50% of the time when running a workload that uses v2.8.1. It is connecting to the public IP.
Thanks, @mvhatch. In that case, let me hammer this some more to see if I can force out the same problem. Have you tried reproducing outside of docker?
I haven't reproduced this yet, but realize I've seen this same problem when people use the readiness probe.
The key error is:
failed to connect to instance: Dial error:
handshake failed (connection name = "<project>:<region>:<instance>"):
context deadline exceeded
Idk if this is helpful, but I've also seen a timeout error in addition to the context deadline exceeded error:
2024/01/08 22:37:09 Authorizing with OAuth2 token
2024/01/08 22:37:09 <project>:<region>:<instance> Listening on [::]:5432
2024/01/08 22:37:09 The proxy has started successfully and is ready for new connections!
2024/01/08 22:37:09 Connection test started
2024/01/08 22:37:39 Connection test passed
2024/01/08 22:37:39 <project>:<region>:<instance> accepted connection from 192.168.96.3:38646
2024/01/08 22:38:09 <project>:<region>:<instance> failed to connect to instance: Dial error: failed to dial (connection name = "<project>:<region>:<instance>"): dial tcp 35.238.253.130: i/o timeout
pg_dump: error: connection to database "<db>" failed: connection to server at "pg_dump--proxy" (192.168.96.2), port 5432 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x
?
Yes that is helpful. It implies the remote end of the socket isn't responding, whether for the initial connection or the TLS handshake.
perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x?
That's still unclear and entirely unexpected.
FWIW I'd expect the connection test to be only used when you're uncertain if the network path is open. Otherwise, I wouldn't use it regularly. In the meantime, we'll sort out this bug.
I have a similar issue when running in dataflow. I am connecting using private ip. It's not very often, something like ~5%, but it causes my dataflow job to fail after spending an hour hung up on this.
curl -o /bin/cloud_sql_proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.9.0/cloud-sql-proxy.linux.amd64
/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &
2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout
2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout
@akshetpandey Are you using the --run-connection-test
flag? This is for what seems to be a race condition when that flag is enabled.
If you're getting TLS errors otherwise, feel free to open a new issue. At a glance, I'd be curious to know what kind of CPU usage you see when that TLS handshake timeout occurs. Sometimes that can be a symptom of the Proxy being resource starved.
It isn't related to that flag, although I should use it to work around the infinite loop I am facing because of this! Will create a new issue
I think this is a duplicate of https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2224 and https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2212, the first of the two being the more relevant description.
We're going to cut a release next week with a fix. If I'm mistaken and this is still happening after the release next week, feel free to re-open. You can try this now by building against the latest cloudsqlconn (https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1).