cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Provide performance benchmarks with and without Proxy

Open Stono opened this issue 2 years ago • 14 comments

Bug Description

Hello! I've been investigating some reports from some of our users around latency spikes. I've narrowed the investigation down to when an application receives a large burst of requests, its connection pool can grow quite rapidly, very quickly. What i'm observing is that parallel connections to cloudsql (via cloudsql proxy) seem to have a linear increase in latency with number of connections being made, affecting all connection attempts, not just the first.

I have a test setup that will connect, perform a query, then disconnect.

For example, here's the timings around a single connection attempt:

{
  "query": 71,
  "connect": 70,
  "durationMs": 72
}

If i make 5 connection attempts in parallel:

[
  {
    "query": 109,
    "connect": 108,
    "durationMs": 110
  },
  {
    "query": 112,
    "connect": 110,
    "durationMs": 112
  },
  {
    "query": 118,
    "connect": 117,
    "durationMs": 119
  },
  {
    "query": 119,
    "connect": 116,
    "durationMs": 120
  },
  {
    "query": 124,
    "connect": 123,
    "durationMs": 124
  }
]

And here is 10:

[
  {
    "query": 147,
    "connect": 146,
    "durationMs": 147
  },
  {
    "query": 156,
    "connect": 155,
    "durationMs": 156
  },
  {
    "query": 160,
    "connect": 157,
    "durationMs": 160
  },
  {
    "query": 161,
    "connect": 159,
    "durationMs": 161
  },
  {
    "query": 163,
    "connect": 161,
    "durationMs": 163
  },
  {
    "query": 174,
    "connect": 173,
    "durationMs": 174
  },
  {
    "query": 180,
    "connect": 177,
    "durationMs": 180
  },
  {
    "query": 183,
    "connect": 179,
    "durationMs": 183
  },
  {
    "query": 184,
    "connect": 178,
    "durationMs": 184
  },
  {
    "query": 185,
    "connect": 182,
    "durationMs": 185
  }
]

The pattern continues the more connections I make.

Example code (or command)

This is the crude bit of typescript i wrote to test this:


const timings: { connect: number; query: number; durationMs: number }[] = []
const connectionTest = async (): Promise<void> => {
  const start = new Date()
  const pgClient = new Client({
    host: 'postgres',
    port: 5432,
    database: 'istio_test',
    user: requireEnv('AT_POSTGRES_USERNAME')
  })
  await pgClient.connect()
  const connect = new Date().getTime() - start.getTime()
  const results = await pgClient.query('SELECT 1 + 1 AS solution')
  if (results.rows[0].solution.toString().trim() !== '2') {
    throw new Error('Did not get the expected response from cloudsql')
  }
  const query = new Date().getTime() - start.getTime()
  await pgClient.end()
  const end = new Date()
  const durationMs = end.getTime() - start.getTime()
  timings.push({ query, connect, durationMs })
}

const promises: Promise<void>[] = []

const iterations = parseInt(req.query.iterations ?? '1', 10)
for (let i = 0; i < iterations; i += 1) {
  promises.push(connectionTest())
}
await Promise.all(promises)

This is using the pg library. However all of our users are using java, so this isn't a client library issue.

Steps to reproduce?

Code sample provided above

Environment

  1. OS type and version: Rocky Linux 8
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.5.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):--private-ip --prometheus --http-address 0.0.0.0 --http-port 9739 --auto-iam-authn, termination period: 30

Additional Details

  • I have reproduced this locally using a local cloudsql proxy, and gcloud default credentials to connect to the instance.
  • I have reproduced this issue without using IAM (username and password instead).

Stono avatar Jul 13 '23 08:07 Stono

Been doing a bit more testing:

Via proxy (SSL), IAM enabled:

1x connection: {"connect":181,"query":211,"durationMs":216}
50x connections: {"connect":793.9,"query":825.68,"durationMs":828}

Via proxy (SSL), username/password:

1x: {"connect":152,"query":183,"durationMs":188}
50x: {"connect":406.48,"query":436.56,"durationMs":437.86}

No proxy, username/password, no SSL:

1x: {"connect":145,"query":169,"durationMs":169}
50x: {"connect":299.44,"query":329.9,"durationMs":329.9}

This is starting to look more like a cloudsql problem rather than a proxy problem?

Stono avatar Jul 13 '23 10:07 Stono

Thought i'd try against pg14 local docker instances:

1x: {"connect":30,"query":32,"durationMs":37}
50x: {"connect":88.18,"query":91.74,"durationMs":99.04}
100x: {"connect":115.23,"query":121.95,"durationMs":132.03}

So now i'm questioning my test script... going to try with a different pg client and eventually a different language.

Certainly one interesting observation here regardless is that IAM is 2x as slow as username/password on new connections.

Stono avatar Jul 13 '23 11:07 Stono

Thanks for the response @Stono.

We don't provide performance benchmarks, but increasingly see a need for it. Let's repurpose this issue for our team to provide some baseline numbers for comparison.

Generally, the Proxy will introduce some overhead as it's a few extra hops to your database (localhost -> proxy -> server side proxy -> localhost db). Separately, we recommend scaling the Proxy with CPU for more throughput and more memory for more connections.

Finally, if you're writing an app in Node.js, we do have a connector now: https://github.com/GoogleCloudPlatform/cloud-sql-nodejs-connector. That will eliminate some of the latency and is worth trying out.

enocom avatar Jul 13 '23 16:07 enocom

Hey @enocom thanks for the response. Pretty sure the latency we're observing now is not the proxy anyway. Even testing against a local postgres instance with no proxy i observe similar behaviours, basically if you fire a batch of connection requests to postgres in parallel, they all take a long time to connect. Adding in TLS and Workload Identity on cloudsql exacerbate that.

Saying that, providing latency numbers for the proxy is always welcome to help folks make informed decisions particularly on latency sensitive applications! You'll probably see what i'm seeing if you start doing the performance benchmarks of proxy yourself. Try 1, 10 concurrent then 50 concurrent.

For context; we use the proxy for a language agnostic way to wrap up connecting to cloudsql on our internal platform (so the apps don't need to worry about it, and they're written in python, java, node, bash, whatever haha). It works really nicely, i actually much prefer this approach to getting people instrumenting their code with libraries that need to be kept up to date (we have circa 600 apps).

Stono avatar Jul 13 '23 16:07 Stono

This is definitely an area where we'd like to provide more guidance to help folks make their own measurements and compare against what we expect customers to see.

I'll update here when we have more on the topic.

enocom avatar Jul 13 '23 16:07 enocom

Hi! Wanted to also mention that performance benchmarks would be appreciated. Our company currently deploys on GKE with each pod having our application container, and cloud sql proxy as a sidecar. Some questions I've had when looking into potential solutions:

  • is it better to use the python client library? or have an app+sidecar? not sure if python overhead is better than k8s traffic overhead.
  • is there a performance benefit in upgrading to v2? we currently use v1.

maybe my questions can help the team think through what kinds of benchmarks are useful. thanks in advance!

honDhan avatar Jul 14 '23 19:07 honDhan

In process connectors will provide a better user experience and will likely be faster given they don't have to do the localhost hop and go straight to the database's proxy server.

We haven't done any formal benchmarking of v1 vs v2, but v2 will startup faster. V2 offers a bunch of additional benefits like support for tracing (as shown above), prometheus support, and others. So we do recommend upgrading.

enocom avatar Jul 14 '23 21:07 enocom

@enocom thanks for the info! is there a Why upgrading to v2 page available somewhere? Many thanks in advance!

bobintornado avatar Jul 25 '23 02:07 bobintornado

We have a migration guide, but don't explicitly compare v1 and v2 other than here.

In addition to the feature list in the README, v2 will start up faster and continue to get new features. v1 meanwhile will continue to get security updates, but new features will land in v2.

enocom avatar Jul 25 '23 18:07 enocom

Just sharing this here as i found it interesting https://twitter.com/BdKozlovski/status/1684098236426878976?t=SWYsfn24ltvFSyEOKHjjEQ&s=19

Cloudflare use Postgres at scale and point out how expensive connections are and how they use https://www.pgbouncer.org to mitigate that.

Led me down a rabbit hole wondering if cloudsql proxy could implement such a pattern. Probably scope creep but would be cool.

Stono avatar Jul 26 '23 09:07 Stono

For what it's worth, a person can run the Proxy behind pgbouncer. And we have an example here of how to do that in a basic way.

enocom avatar Aug 01 '23 17:08 enocom