cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Reload service account keyfile periodically

Open kevincvlam opened this issue 6 years ago • 20 comments

Hi,

We run the CloudSQL proxy in our kubernetes cluster as a deployment and sometimes we rotate the secret that is used to provide the credentials file for IAM authentication.

As a result the credentials loaded at start-up of the proxy become invalid and the proxy begins printing invalid credentials errors, but does not error out. What's the recommended way to handle this situation? Is there a way to have the proxy reload the credentials?

My understanding is that mounted secrets are updated automatically, so it's up to the application to respond accordingly:

Mounted Secrets are updated automatically When a secret being already consumed in a volume is updated, projected keys are eventually updated as well. The update time depends on the kubelet syncing period.

kevincvlam avatar Sep 06 '18 15:09 kevincvlam

Hey @kevincvlam, thanks for bringing this issue to our attention.

We discussed this issue this morning, and decided that currently the only way to reload the credentials would be to restart the container with the proxy inside. This is obviously not an ideal solution, so we are investigating ways we could handle this. Currently we are looking into the following:

  1. Reload credentials hourly during SSL cert refresh
  2. Attempt credentials reload upon receiving an invalid credentials error
  3. Potentially exit with error code if failing to retrieve valid credentials after X minutes

We'll be using this issue to update our progress on this issue.

kurtisvg avatar Sep 06 '18 18:09 kurtisvg

Hey @kurtisvg, thanks for the quick reply, and looking forward to your solution!

Do you have any idea regarding when you expect the issue to be resolved?

kevincvlam avatar Sep 10 '18 18:09 kevincvlam

Unfortunately, I don't have any promises to make at the moment, just that it's in the queue and the team will get to it when we can. If you have any expertise in this area, we are open to contributions.

kurtisvg avatar Sep 10 '18 19:09 kurtisvg

Hey folks,

This is affecting us as well. The setup we have is that we store the service account key in a Kubernetes secret, which is mounted to the Cloud SQL Proxy sidecar. And we rotate the service account key every day, and replace the content of the secret.

And as far as I understand, when we change the content of the secret, that change is automatically propagated to the mounted file seen by the running proxy container. (probably this is the same setup @kevincvlam described?)

This is not handled by the proxy, so if the mounted key file changes, that's not picked up by the running proxy, right?
Do you have any update on the timeline when this improvement can be expected?

Thanks!

markvincze avatar Mar 22 '19 16:03 markvincze

In our golang applications we handle the reloading of the key by reinitialising an instance of the class that uses the service account key with code like this:

dnsService := NewGoogleCloudDNSService(*googleCloudDNSProject, *googleCloudDNSZone)

foundation.WatchForFileChanges(os.Getenv("GOOGLE_APPLICATION_CREDENTIALS"), func(event fsnotify.Event) {
	log.Info().Msg("Key file changed, reinitializing dns service...")
	dnsService = NewGoogleCloudDNSService(*googleCloudDNSProject, *googleCloudDNSZone)
})

See https://github.com/estafette/estafette-google-cloud-dns/blob/09eaf7f4123b6c4a012837f2415893219456d137/main.go#L81-L84 and https://github.com/estafette/estafette-foundation/blob/master/foundation.go#L104-L161 for implementation details.

Works like a charm and relies on the github.com/fsnotify/fsnotify libary, which doesn't bring in too many dependencies.

JorritSalverda avatar Mar 25 '19 12:03 JorritSalverda

I made some changes that address the failures that I've been seeing. It's not comprehensive, and it's pretty hacktastic, but it's survived a day of having Vault rotate the service account keys from underneath it and the new keys mounted into the k8s container. There are three specific points where it can recover: when the credential file is missing or corrupt at startup; at first connection; and failure to rotate the ephemeral cert. I'm sure there are many other places it could fail, but those are the ones I've been running into.

I'm not going to submit a PR in this state, but I figured if anyone else had a need for this, they could take what I have. If it's within shouting distance of being acceptable, though, I can try to polish it up a bit.

dhduvall avatar Nov 08 '19 19:11 dhduvall

Missing ability to reload service account keyfile is still an open issue. The only workaround is described in https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/770 which is basically:

  1. update the keyfile
  2. stop with kill -s SIGTERM "$PPID";
  3. start again with /cloud_sql_proxy ...

gw0 avatar Jun 24 '21 11:06 gw0

Related to https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1045.

enocom avatar Nov 17 '22 21:11 enocom

Bumping the priority given the interest here.

enocom avatar Nov 17 '22 21:11 enocom

Hi all,

This is also affecting us.

I am running cloud_sql_proxy in a sidecar container in a number of our pods.

As soon as I update our secret, cloud_sql_proxy starts failing because it is still using the old secret that it has in memory.

We cannot resort to SIGHUPping the process as the image is prebuilt and controlled by my organisation (and I cannot modify it), but also this is a workaround rather than a solution.

At the moment I have resorted to deleting all active pods after a key renewal (luckily, we only have to do it once a month) but this is obviously a worse workaround to SIGHUP.

Could a more appropriate solution be provided please?

Many thanks!

gdafl avatar Nov 23 '22 12:11 gdafl

Hi,

if you are running your workload within GKE you should evaluate "workload identity" as this is the recommended way. With workload identity you don't have to mess around with JSON keys at all. Nevertheless this issue is still relevant for workloads running outside the Google ecosystem!

UnsignedLong avatar Nov 23 '22 12:11 UnsignedLong

Workload identity does sidestep these problems and is the best solution if you're running in GKE.

Otherwise, we're probably looking at some kind of watcher implementation based on fsnotify. Perhaps this is something people should have to opt-in to as well with a CLI flag.

enocom avatar Nov 23 '22 16:11 enocom

Hi,

if you are running your workload within GKE you should evaluate "workload identity" as this is the recommended way. With workload identity you don't have to mess around with JSON keys at all. Nevertheless this issue is still relevant for workloads running outside the Google ecosystem!

Just a quick update, I switched to Workload Identities for our GKE cloud-sql-proxy sidecars and it's working perfectly.

A solution to this issue would still be useful for non-GKE based deployments though.

Many thanks again for the suggestion!

gdafl avatar Nov 29 '22 16:11 gdafl

It would be helpful to know how many people want this outside of GKE.

If you're running in GKE, then we strongly recommend using workload identity. Otherwise, this might be useful, but again if the ask here is mostly from GKE workloads, then it's probably not a big priority.

enocom avatar Feb 01 '23 03:02 enocom

It would be helpful to know how many people want this outside of GKE.

If you're running in GKE, then we strongly recommend using workload identity. Otherwise, this might be useful, but again if the ask here is mostly from GKE workloads, then it's probably not a big priority.

Personally, I switched to workload identities as soon as it was suggested which made this issue moot.

I do still think it's a good feature to add though, to align what cloudsql-proxy does with what GKE does when a secret is updated.

Thanks!

gdafl avatar Feb 01 '23 08:02 gdafl

Given the prevalence of workload identity, we're going to hold off on this feature. If there's interest in the future, please re-open with why it's useful.

enocom avatar Aug 15 '23 21:08 enocom

I have on premise workloads accessing CloudSQL. As workload identity is unavailable in my (and other) environments I still see a huge benefit in this feature.

UnsignedLong avatar Aug 16 '23 10:08 UnsignedLong

Re-opening in that case. What are you using to refresh your credentials file?

enocom avatar Aug 16 '23 16:08 enocom