cloud-sql-jdbc-socket-factory Add a "lazy strategy" option that prevents refreshes from happening in the background

Bug Description

Every now and then we see a failure to refresh the ephemeral certificate used to connect to Cloud SQL. Most of the time this is just an annoying error in the log, but the service tries again and succeeds most of the time. This is still annoying, since it adds a lot of noise to our monitoring. But every now and then also the retry is failing, and we lose traffic because we don't have any certificate to authenticate to the DB.

The error that we get is this:

Got more than one input failure. Logging failures after the first
java.lang.RuntimeException: [...] Failed to update metadata for Cloud SQL instance.
	at com.google.cloud.sql.core.CloudSqlInstance.addExceptionContext(CloudSqlInstance.java:598)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:505)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Error writing to server
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:718)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:730)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1613)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
	at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)
	at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:152)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:460)
	... 9 more

Environment

OS type and version: Cloud Run Service
Java SDK version:

Base docker image is eclipse-temurin:11-jre-focal

openjdk 11.0.16.1 2022-08-12 OpenJDK Runtime Environment Temurin-11.0.16.1+1 (build 11.0.16.1+1) OpenJDK 64-Bit Server VM Temurin-11.0.16.1+1 (build 11.0.16.1+1, mixed mode, sharing)

Socket Factory version: com.google.cloud.sql:jdbc-socket-factory-core:jar:1.6.3

Sep 21 '22 17:09 johanblumenberg

We have been in contact with the google support, but after 3 months the only sensible response that we have gotten so far is that it might be related to the fact that Cloud Run uses CPU throttling and does not support background tasks. Since the refresh is done periodically as a background task, this might be an issue.

This is the support ticket: https://console.cloud.google.com/support/cases/detail/v2/29992086?project=veritru-dev-332314

Sep 21 '22 17:09 johanblumenberg

We have created a patch to remove the background process, and this seems to solve the problem. We have used the same patch both in our Cloud Run services and in our Cloud Functions, and so far it seems to work. We have not seen any failures in our logs since we applied the patch.

We have been running with this patch for 8 days now, with zero errors. Before applying the patch we could see failures every other day or so. We have also seen that we can avoid the error in our Cloud Run services by enabling CPU always on, which also indicates that the problem is indeed related to the background process. Unfortunately this workaround is not available for Cloud Functions.

This patch solves the problem for us: https://github.com/veritru/google-cloud-patch/commit/bf94b6b2e96dc2ec3bd6e53d06e433254d715bbe

This solution is probably not ideal, because it completely disables the background process and refreshes the certificate on the thread that needs it while processing a request. In cases where you don't use CPU throttling you probably would like the certificate to be updated in the background, instead of adding a delay to the request processing.

Sep 21 '22 17:09 johanblumenberg

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after. It's still probably happening in the background, but now it'll fail silently.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

Sep 21 '22 18:09 kurtisvg

As I wrote, this is just a proposal. There are probably better ways to solve the problem.

So to be clear, it doesn't look like you've removed the background process, but have just made it force a new refresh before an error has occurred rather than after.

There is no background process. When there is no traffic towards our services, no refresh is done. I would see in the logs if a refresh happened, and there is none.

The constructor still schedules a single refresh operation, because the currentInstanceData and nextInstanceData member variables should not be null. Otherwise you would have to handle the special case where these variables are null on the first access. Once the first refresh finishes, no new job is scheduled automatically.

I removed the code that schedules another refresh automatically. Instead it is forced when you access the SSL data. So the pending refresh job only exists for a short time while it is being executed. It is never scheduled to run in the future.

It's still probably happening in the background, but now it'll fail silently.

It's not failing silently. We would still see the failed refresh in the logs, even if it doesn't cause any incoming traffic to fail. We have not seen any refresh failures after this patch was applied.

Maybe an ideal solution would be to offer some option to specify a retry strategy, and add a "lazy" option that retries as needed rather than automatically.

Yes, I think that would be a good idea

Sep 21 '22 18:09 johanblumenberg

Posting some rational here for why we consider this a P2 for now:

We believe it's a fairly rare occurrence: it occurs when the process is throttled after the refresh has already started but before it is allowed to complete. If the process is throttled before the refresh has started, it's unlikely to have enough CPU to be begin until the process is unthrottled.
It's a fairly invasive change: currently we use a 2 thread executor that's shared between all of the Cloud SQL instances. We need to decide how to handle that executor if we don't want to execute requests in the background. We also need to make the behavior configurable and persist in a logical way with the current behavior, which is preferred for most users. There's a minimum of a few weeks of work to come up with a design and verify it doesn't introduce any new issues.
A workaround is fairly simple: because a new refresh is triggered immediately after a scheduled refresh fails which blocks future connections, request a new connection should lead to a successful refresh. Trying a second time to grab a new connection should allow a refresh operation to complete successfully, e.g.: a. Refresh operation starts b. Process is throttled c. Process is unthrottled d. Refresh operation throws exception because the interaction with the Admin API has expired e. A new refresh operation is scheduled, blocking future connection requests f. App/Pool grabs a new connection, which blocks until the refresh is complete g. Refresh operation completes -> connect attempt is successful h. Request is complete, process is throttled again until the next request

Oct 18 '22 23:10 kurtisvg

For anyone discovering this issue, note that Cloud Run expects any background activity (as happens with the Java Connector here) to not work reliably. Instead, you must enable "CPU Allways On" for the background refresh to work reliably.

This issue then is to remove the need to run a background refresh thread, through some user-settable configuration.

Oct 13 '23 18:10 enocom

cloud-sql-jdbc-socket-factory cloud-sql-jdbc-socket-factory copied to clipboard

Add a "lazy strategy" option that prevents refreshes from happening in the background

Bug Description

Environment

cloud-sql-jdbc-socket-factory
cloud-sql-jdbc-socket-factory copied to clipboard