cloud-sql-go-connector
cloud-sql-go-connector copied to clipboard
Dial fails when function is suspended during eager refresh
Bug Description
I'm experiencing periodic dial errors from cloudsqlconn running in Vercel function.
The function in question experiences near-zero traffic during low periods, except for a cron ping every 5 minutes. When traffic is limited to just these 5-minute pings, the pings fail on db connection errors at multiples of hourly intervals (i.e. sometimes just an hour between, sometimes 2 or 3 hours).
I've added all the logging and tracing, and this is what I get from cloudsqlconn & pgx logs:
START RequestId: c2a894e2-29bb-4d2e-a657-432e881cb461
2025/05/11 10:10:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
END RequestId: c2a894e2-29bb-4d2e-a657-432e881cb461
START RequestId: 45f7d3c1-0c86-42bb-a4c9-c0630725b5a9
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context deadline exceeded
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Removing connection info from cache: refresh failed: context deadline exceeded
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context canceled
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Instance is closed, stopping refresh operations
2025/05/11 10:15:00 ERROR [169.254.26.129/iTpgPaTqbJ-000021]: Connect map["database":"redacted" "err":"failed to connect to `redacted`: [redacted]:5432 (): dial error: timeout: context deadline exceeded" "host":"" "port":'ᔸ' "time":"146.621µs"]
END RequestId: 45f7d3c1-0c86-42bb-a4c9-c0630725b5a9
The client's retry request then successfully connects:
START RequestId: 1d5f156e-e6c7-4f52-b6cd-77b5fc322ed0
2025/05/11 10:15:01 DEBUG: cloudsqlconn: [redacted] Connection info added to cache
2025/05/11 10:15:01 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation complete
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Current certificate expiration = 2025-05-11T11:10:00Z
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled at 2025-05-11T11:06:00Z (now + 51m0s)
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Now = 2025-05-11T10:15:02Z, Current cert expiration = 2025-05-11T11:10:00Z
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Cert is valid = true
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Dialing redacted:3307
END RequestId: 1d5f156e-e6c7-4f52-b6cd-77b5fc322ed0
It looks to me like the function is suspended between the first two requests, causing the refresh to fail, and then the subsequent immediate refresh during the second request also fails.
I'm going to switch this to use the lazy refresher to try mitigating, but this feels like something the eager refresher should be able to handle.
Dialer created with
tokenSource := oauth2adapt.TokenSourceFromTokenProvider(creds)
d, err := cloudsqlconn.NewDialer(
context.Background(),
cloudsqlconn.WithDebugLogger(cloudConnLogger{}),
// This option might not be necessary as it looks like it's superceded by WithIAMAuthNTokenSources, but
// no harm in setting it.
cloudsqlconn.WithTokenSource(tokenSource),
cloudsqlconn.WithIAMAuthN(),
cloudsqlconn.WithIAMAuthNTokenSources(tokenSource, tokenSource),
)
Environment
- OS type and version: Vercel deployment so presumably Linux
- Go version: 1.24
- Cloud SQL Go Connector version: from go.mod:
cloud.google.com/go/cloudsqlconn v1.16.1 h1:0DaCn076Qr7H5TsHm65k/fzjAv4dNUQB2fq9ULzmXMU=
cloud.google.com/go/cloudsqlconn v1.16.1/go.mod h1:T6G4pmABtHqRyTF8M+oUe1KELh6Rs4PVSngAZDBz3zA=
Additional Details
No response
Another series of logs, in case it helps, with various refreshes happening, and only one failing:
START RequestId: 0cf0dac9-40c9-481e-bc31-5a89043760b5
2025/05/11 19:35:00.259151 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context deadline exceeded
2025/05/11 19:35:00.259171 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 19:35:00.259183 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
END RequestId: 0cf0dac9-40c9-481e-bc31-5a89043760b5
START RequestId: e5edbf5b-a34c-48df-9b88-60488900ecc7
2025/05/11 19:40:00.261947 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context deadline exceeded
2025/05/11 19:40:00.261953 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 19:40:00.261961 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
END RequestId: e5edbf5b-a34c-48df-9b88-60488900ecc7
START RequestId: 647f388f-68d5-4fbe-b6b7-20b0bf6752cf
2025/05/11 19:45:00.365038 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context deadline exceeded
2025/05/11 19:45:00.365053 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 19:45:00.365067 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 19:45:00.366008 DEBUG: cloudsqlconn: [redacted] Removing connection info from cache: refresh failed: context deadline exceeded
2025/05/11 19:45:00.366240 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context canceled
2025/05/11 19:45:00.366247 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 19:45:00.366254 DEBUG: cloudsqlconn: [redacted] Instance is closed, stopping refresh operations
2025/05/11 19:45:00.366929 ERROR [169.254.25.109/0LxJ6V82Vc-000024]: Connect map["database":"redacted" "err":"failed to connect to `redacted`: [redacted]:5432 (): dial error: timeout: context deadline exceeded" "host":"" "port":'ᔸ' "time":"145.033µs"]
END RequestId: 647f388f-68d5-4fbe-b6b7-20b0bf6752cf
START RequestId: 76cc977e-0355-44d3-83bd-57a27fbf3deb
2025/05/11 19:45:01.538560 DEBUG: cloudsqlconn: [redacted] Connection info added to cache
2025/05/11 19:45:01.538594 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 19:45:02.305285 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation complete
2025/05/11 19:45:02.305301 DEBUG: cloudsqlconn: [redacted] Current certificate expiration = 2025-05-11T19:55:00Z
2025/05/11 19:45:02.305305 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled at 2025-05-11T19:51:00Z (now + 6m0s)
2025/05/11 19:45:02.305318 DEBUG: cloudsqlconn: [redacted] Now = 2025-05-11T19:45:02Z, Current cert expiration = 2025-05-11T19:55:00Z
2025/05/11 19:45:02.305320 DEBUG: cloudsqlconn: [redacted] Cert is valid = true
2025/05/11 19:45:02.305337 DEBUG: cloudsqlconn: [redacted] Dialing redacted:3307
END RequestId: 76cc977e-0355-44d3-83bd-57a27fbf3deb
START RequestId: a739caea-68dd-4592-969e-7d48570c694e
END RequestId: a739caea-68dd-4592-969e-7d48570c694e
START RequestId: 2a653aea-1d12-4968-ad4a-34b23b901143
2025/05/11 19:55:00.236401 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
END RequestId: 2a653aea-1d12-4968-ad4a-34b23b901143
START RequestId: 7fb76e6c-8515-4f05-b1ee-32ecd9b75535
2025/05/11 20:00:00.261609 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation complete
2025/05/11 20:00:00.261627 DEBUG: cloudsqlconn: [redacted] Current certificate expiration = 2025-05-11T20:55:00Z
2025/05/11 20:00:00.261635 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled at 2025-05-11T20:51:00Z (now + 51m0s)
END RequestId: 7fb76e6c-8515-4f05-b1ee-32ecd9b75535
Hi @devnev
It's possible the certificate refresh isn't getting enough CPU time in Vercel when the certificates are already expired. As a workaround, I'd suggest trying lazy refresh, instead of the regular refresh strategy, in a serverless environment like Vercel. You can do this by using the 'WithLazyRefresh()' option.