cloud-sql-proxy
cloud-sql-proxy copied to clipboard
Tolerate bad instances when proxying for many instances (unix sockets)
Feature Description
We run various services (e.g. pgadmin) that use cloud sql proxy as a sidecar to enable IAM-based auth. We have ~100 cloud sql instances, and each of our services (e.g. pgadmin) supports connecting to any of these. Some instances may be connected to rarely, others more frequently.
Whenever cloud sql proxy starts (or restarts), if any of the instances on the command-line have been deleted (or are misconfigured, or are denied by IAM), the proxy will exit with an error. This then causes the service (e.g. pgadmin) to not be able to connect to any of the instances until the issue is addressed. This is particularly problematic if the restart was automatic (many of these services run in GKE/Cloud Run).
Is it possible to better support this case and not let an issue with an instance prevent proxying all the others? We are willing to contribute a PR, if there is a simple solution that your team would accept. Or perhaps there is some feature I've missed that could solve this already.
Would it make sense to add a command-line flag to allow disregarding bad instances (i.e. just log and continue)?
Sample code
// sample code here
Alternatives Considered
We have considered a few alternatives:
- wrap cloud sql proxy in a process that would determine and exclude bad connections before starting the proxy (this is rather clunky and would require updates to pick up cloud sql changes, assuming we create a new docker image that includes both)
- run a cloud sql proxy per instance (or per ~10 instances); this is clunky, wasteful, and only helps somewhat
- startup probes: this helps somewhat but doesn't solve many cases (e.g. automatic restarts)
Additional Details
No response
How are you starting the Proxy?
If I understand what you're proposing correctly, this is already supported as long as you specify a port numbering scheme, either with --port <PORT> or ./cloud-sql-proxy '<INSTANCE_CONNECTION_NAME>?port=<PORT>'.
How are you starting the Proxy?
If I understand what you're proposing correctly, this is already supported as long as you specify a port numbering scheme, either with
--port <PORT>or./cloud-sql-proxy '<INSTANCE_CONNECTION_NAME>?port=<PORT>'.
./cloud-sql-proxy --unix-socket /cloudsql-iam --auto-iam-authn project1:us-central1:instance1 project1:us-central1:instance2
If one of the instances is not reachable, this quickly exits with:
2025/05/27 17:47:22 The proxy has encountered a terminal error: unable to start: [...] Unable to mount socket: failed to get instance: refresh error: failed to get instance metadata (connection name = "..."): googleapi: Error 404: The Cloud SQL instance does not exist., instanceDoesNotExist
I checked with TCP and discovered that it doesn't have this behavior (i.e. TCP tolerates bad instances). I guess we don't need a flag, just need to bring unix sockets inline with TCP? Although that would be a bit of a departure from previous unix socket behavior.
I'm using cloud-sql-proxy version 2.16.0+linux.amd64, and it also reproduces at HEAD.
This was actually a flag in v1 Proxy ( skip_failed_instance_config) that we purposely avoided porting to v2 thinking that it was unneeded. See https://github.com/GoogleCloudPlatform/cloud-sql-proxy/blob/main/migration-guide.md#flag-changes.
But it might be worth bringing this back after all. We have another big user of v1 Proxy who needs the same behavior. In any case, that's my 2 cents. @hessjcg wdyt?
@enocom any further thoughts / updates here? :)
@hessjcg and @kgala2 are the owners of the Proxy these days. So I'll leave it to them to respond.
I think given that skipping invalid instances only works with TCP listeners, and people still like using Unix domain sockets, it's probably a good idea to port the old flag (renaming it to skip-failed-instance-config to match standard flag format).
Hi @kenkania
Thanks for the PR - https://github.com/GoogleCloudPlatform/cloud-sql-proxy/pull/2452, we will review and test the changes and once approved, include it in the next month's release!
Hi @kenkania
Thanks for the PR - #2452, we will review and test the changes and once approved, include it in the next month's release!
Great, let me know if there's anything else I can do to ensure it makes it into the next release.
Thanks, @kenkania! We've reviewed and merged the PR, and you'll see the changes in July's release. This update will definitely provide our customers with a lot more flexibility.