cloud-sql-python-connector icon indicating copy to clipboard operation
cloud-sql-python-connector copied to clipboard

Change cert refresh behavior for managed-pool environments

Open baarden opened this issue 2 years ago • 1 comments

Feature Description

Add an option to change the cert refresh behavior to be more friendly for managed-pool environments.

When a pool of connections is being maintained by SQLAlchemy, there is no need for the connections to do their own maintenance since SQLAlchemy already closes unhealthy connections and opens new connections as needed. And because pooled connections do not get closed actively, the current refresh behavior can result in too many connections being held open, and surges of refreshes every 55 minutes if multiple pools were opened at the same time. (For instance: if 10 Dataflow pipelines are opened at once, it could mean a minimum of 10 pipelines x 2 VMs x 2 vCPUs = 40 connection pools.)

Similarly, the current behavior of immediately retrying a failed refresh can lead to a thundering herd of refreshes if the Cloud SQL API requests-per-minute quota has been exceeded. In a managed pool it will be safer to wait 60 seconds for the next quota period.

Additional Context

Here is a PR as a possible implementation: https://github.com/GoogleCloudPlatform/cloud-sql-python-connector/pull/414

baarden avatar Jul 31 '22 20:07 baarden

Hi @baarden -- thanks for opening a FR (and PR). It's always appreciated. However, I'm not sure the proposed changes will fix the problem you are running into.

Add an option to change the cert refresh behavior to be more friendly for managed-pool environments.

I think the problem you are encountering is different than the one you've presented here. The current behavior is that multiple connections to the same instance (made using the same connector) reuse the refresh. There's nothing intrinsically different to pooling that makes the current refresh behavior harmful (infact, we generally optimize for the pooling scenario since it's a production best practice).

The behavior your PR introduced is essentially "stop refreshing automatically", which is probably not preferable for most users/applications. If a user calls the connection and a refresh needs to happen, it'll need to do some additional API calls which might delay the time it takes to get a connection by up to 1s. If the refresh happens early (and asynchronously), there is more room for a failed API call to retry.

And because pooled connections do not get closed actively, the current refresh behavior can result in too many connections being held open

As I mentioned above, the refresh is cached between multiple connections to the same instance. 10 connections to "my-instance" is the same number of refreshes as 1 connection. Additionally, the Connector is only responsible for the creation of the connection - after that it hands it to SQLAlchemy to manage. If you have too many open connections, you probably need to tweak your SQLAlchemy settings.

As you mentioned, pools open and close connections all the time, including recreating unhealthy or closed connections. Not refreshing early makes the refresh operation just block these connections, which might introduce more problems.

if multiple pools were opened at the same time. (For instance: if 10 Dataflow pipelines are opened at once, it could mean a minimum of 10 pipelines x 2 VMs x 2 vCPUs = 40 connection pools.)

I'm not a dataflow expert, but my understanding is that you should be using DoFn.Setup to initialize your connection pool one time so it's shared between threads. This should lower the number of pools from at least 40 to 4?

kurtisvg avatar Aug 01 '22 14:08 kurtisvg

Closing due to inactivity

jackwotherspoon avatar Nov 28 '22 15:11 jackwotherspoon