teleport icon indicating copy to clipboard operation
teleport copied to clipboard

GCP SQL PostgreSQL backend

Open greedy52 opened this issue 9 months ago • 3 comments

changelog: added GCP Cloud SQL for PostgreSQL backend support

Doc draft WIP: https://docs-6pnqngs1d-goteleport.vercel.app/docs/reference/backends/#google-cloud-iam-authentication

Testing checklist:

  • [x] change feed as a different database user (token creator)
  • [x] events
  • [x] credentials using service account impersonation through Workload Identity Federation with AWS
  • [x] Private IP for Cloud SQL
  • [x] Compatibility tests in pgbk for GCP SQL
  • [ ] load test (TBD v16 release testing)

Testing setup: see above doc preview

greedy52 avatar May 09 '24 20:05 greedy52

Is there a reason why we're not using cloud.google.com/go/cloudsqlconn and cloud.google.com/go/alloydbconn?

Part of not-using-the-connector-lib is to support the case to authenticate as a different database user for change feed.

When using a Cloud SQL connector with automatic IAM database authentication, the IAM account that you use to start the connector must be the same account that authenticates the database.

We can hack the connector somehow to authenticate as another database user. We probably gonna end up with the same amount of code of what I currently have (if not more).

Also, personally find it harder to use the project:region:instance than just the IP. The benefit of using the connector is the connector will make API calls to resolve the IP automatically. But you are supposed to programmatically specify how it chooses the IP like cloudsqlconn.WithPrivateIP(), which can be a hassle if we want to provide that kind of option through our config.

What do you think? I can give it a shot using the connector if you feel it's worth pursuing.


Alternatively, we can support both project:region:instance and the IP as host. If project:region:instance is specified, we use the connector lib and will not support authenticating as a different database user. If IP is specified, it goes to current impl. I rather not do this though. I would prefer just support IP in first release, and implement the resolver part later using custom logic if a customer asks for it.

greedy52 avatar May 16 '24 16:05 greedy52

Converted to draft to try out the connector lib

greedy52 avatar May 16 '24 19:05 greedy52

@espadolini updated the PR now to use the connector lib. sample usage see preview of doc draft https://docs-qhi80m1z1-goteleport.vercel.app/docs/reference/backends/#google-cloud-iam-authentication.

Opening for review again.

greedy52 avatar May 21 '24 19:05 greedy52

Load Test:

Setup

  • Two e2-standard-8 (8vCPU, 32GB mem) VMs on GCP
  • Cloud SQL: DB version: PostgreSQL 15.5, vCPUs: 8, Memory: 32 GB, SSD storage: 250 GB

teleport.yaml:

version: v3
teleport:
  connection_limits:
    max_connections: 65000
    max_users: 1000
  log:
    severity: WARN
    format:
      output: text
  auth_server: auth.gcploadtest.dev.aws.stevexin.me
  storage:
    type: postgresql
    auth_mode: gcp-cloudsql
    gcp_connection_name: teleport-dev-320620:us-central1:steve-postgres-loadtest
    gcp_ip_type: "private"
    conn_string: postgresql://[email protected]@/teleport_backend

    audit_events_uri:
      - "postgresql://[email protected]@/teleport_audit#auth_mode=gcp-cloudsql&gcp_connection_name=teleport-dev-320620:us-central1:steve-postgres-loadtest&gcp_ip_type=private"
    audit_sessions_uri: "file:///var/lib/teleport/logs/"

The run

Run tctl loadtest for 30 minutes n both servers, no errors:

tctl loadtest node-heartbeats --count=15000 --ttl=2m --interval=1m --labels=2 --concurrency=32
[i] Setting up node hb load generation. count=15000, churn=0, labels=2, interval=1m0s, ttl=2m0s, concurrency=32
[i] Estimated serialized node size: 606 (bytes)
[i] Queued heartbeat batch for emission. generation=1, errors=0
[i] Queued heartbeat batch for emission. generation=2, errors=0
...
Screenshot 2024-06-03 at 11 42 48 AM

VM memory usage from top:

Screenshot 2024-06-03 at 11 41 03 AM

VM cpu and network usage:

Screenshot 2024-06-03 at 12 08 49 PM

Cloud SQL metrics

Screenshot 2024-06-03 at 12 11 14 PM Screenshot 2024-06-03 at 12 11 21 PM

Manual test during load test

  • Login/logout through web ui
  • SSH session through web ui
  • Add/Delete user through web ui
  • Run tctl nodes and tctl inventory commands

No slowness or error/warning logs observed during the test.

greedy52 avatar Jun 03 '24 16:06 greedy52

I will resolve the lint/merge conflict and cherry-pick the logger change then merge this. Thanks everyone!

greedy52 avatar Jun 03 '24 16:06 greedy52

FYI go.mod change updated google.golang.org/protobuf v1.34.1 and have to run make grpc. Only version diffs after run:

-//     protoc-gen-go v1.34.0
+//     protoc-gen-go v1.34.1

https://github.com/gravitational/teleport/pull/41392/commits/eebc9629c16fe337ed7ad350e3ca9af592f9908c

--- update It happened that master is also updated with v1.34.1. I've merged with master again....

greedy52 avatar Jun 03 '24 18:06 greedy52

@greedy52 See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed