cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Tolerate bad instances when proxying for many instances (unix sockets)

Open kenkania opened this issue 5 months ago • 7 comments
trafficstars

Feature Description

We run various services (e.g. pgadmin) that use cloud sql proxy as a sidecar to enable IAM-based auth. We have ~100 cloud sql instances, and each of our services (e.g. pgadmin) supports connecting to any of these. Some instances may be connected to rarely, others more frequently.

Whenever cloud sql proxy starts (or restarts), if any of the instances on the command-line have been deleted (or are misconfigured, or are denied by IAM), the proxy will exit with an error. This then causes the service (e.g. pgadmin) to not be able to connect to any of the instances until the issue is addressed. This is particularly problematic if the restart was automatic (many of these services run in GKE/Cloud Run).

Is it possible to better support this case and not let an issue with an instance prevent proxying all the others? We are willing to contribute a PR, if there is a simple solution that your team would accept. Or perhaps there is some feature I've missed that could solve this already.

Would it make sense to add a command-line flag to allow disregarding bad instances (i.e. just log and continue)?

Sample code

// sample code here

Alternatives Considered

We have considered a few alternatives:

  • wrap cloud sql proxy in a process that would determine and exclude bad connections before starting the proxy (this is rather clunky and would require updates to pick up cloud sql changes, assuming we create a new docker image that includes both)
  • run a cloud sql proxy per instance (or per ~10 instances); this is clunky, wasteful, and only helps somewhat
  • startup probes: this helps somewhat but doesn't solve many cases (e.g. automatic restarts)

Additional Details

No response

kenkania avatar May 27 '25 18:05 kenkania

How are you starting the Proxy?

If I understand what you're proposing correctly, this is already supported as long as you specify a port numbering scheme, either with --port <PORT> or ./cloud-sql-proxy '<INSTANCE_CONNECTION_NAME>?port=<PORT>'.

enocom avatar May 27 '25 21:05 enocom

How are you starting the Proxy?

If I understand what you're proposing correctly, this is already supported as long as you specify a port numbering scheme, either with --port <PORT> or ./cloud-sql-proxy '<INSTANCE_CONNECTION_NAME>?port=<PORT>'.

./cloud-sql-proxy --unix-socket /cloudsql-iam --auto-iam-authn project1:us-central1:instance1 project1:us-central1:instance2

If one of the instances is not reachable, this quickly exits with:

2025/05/27 17:47:22 The proxy has encountered a terminal error: unable to start: [...] Unable to mount socket: failed to get instance: refresh error: failed to get instance metadata (connection name = "..."): googleapi: Error 404: The Cloud SQL instance does not exist., instanceDoesNotExist

I checked with TCP and discovered that it doesn't have this behavior (i.e. TCP tolerates bad instances). I guess we don't need a flag, just need to bring unix sockets inline with TCP? Although that would be a bit of a departure from previous unix socket behavior.

I'm using cloud-sql-proxy version 2.16.0+linux.amd64, and it also reproduces at HEAD.

kenkania avatar May 27 '25 22:05 kenkania

This was actually a flag in v1 Proxy ( skip_failed_instance_config) that we purposely avoided porting to v2 thinking that it was unneeded. See https://github.com/GoogleCloudPlatform/cloud-sql-proxy/blob/main/migration-guide.md#flag-changes.

But it might be worth bringing this back after all. We have another big user of v1 Proxy who needs the same behavior. In any case, that's my 2 cents. @hessjcg wdyt?

enocom avatar May 27 '25 22:05 enocom

@enocom any further thoughts / updates here? :)

kenkania avatar Jun 02 '25 14:06 kenkania

@hessjcg and @kgala2 are the owners of the Proxy these days. So I'll leave it to them to respond.

I think given that skipping invalid instances only works with TCP listeners, and people still like using Unix domain sockets, it's probably a good idea to port the old flag (renaming it to skip-failed-instance-config to match standard flag format).

enocom avatar Jun 02 '25 15:06 enocom

Hi @kenkania

Thanks for the PR - https://github.com/GoogleCloudPlatform/cloud-sql-proxy/pull/2452, we will review and test the changes and once approved, include it in the next month's release!

kgala2 avatar Jun 11 '25 18:06 kgala2

Hi @kenkania

Thanks for the PR - #2452, we will review and test the changes and once approved, include it in the next month's release!

Great, let me know if there's anything else I can do to ensure it makes it into the next release.

kenkania avatar Jun 16 '25 22:06 kenkania

Thanks, @kenkania! We've reviewed and merged the PR, and you'll see the changes in July's release. This update will definitely provide our customers with a lot more flexibility.

kgala2 avatar Jun 23 '25 19:06 kgala2