cloud-sql-proxy
cloud-sql-proxy copied to clipboard
Provide a way to shutdown the process from another container
Feature Description
It is a common practice to run database migrations in an init container and wait for the migration job to finish. However, when the proxy never shuts down, init script stalls. Would be good to provide a way to shut down the proxy gracefully. For example an endpoint for other containers to call when finished.
Alternatives Considered
Not sure really. If Kubernetes had a native way to shutdown the whole pod and deliver the signal to the container, that could work too. But I think an endpoint could be easier and more straightforward.
Additional Context
An example tutorial on how to run migrations with init containers, kubernetes jobs and k8s-wait-for: https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-8-running-database-migrations-using-jobs-and-init-containers/
Seems that there was a proposal for kubernetes native functionality for this, but it will be reworked and there is no timeline for it: https://github.com/kubernetes/enhancements/issues/753
I think an endpoint could be a good workaround for this in the meantime from the user perspective.
I'm not entirely convinced an endpoint makes the most sense here. This problem was discussed in #128. #206 added the -term_timeout
flag to close that issue.
You could start the proxy with this flag, sleep for some amount of time for your application to start up, and then send a sigterm to the proxy. It'll exit gracefully when the active connections hit 0 (or ungracefully when the term_timeout is reached).
Additionally, this method was suggested as a workaround in the original issue. It essentially uses a sentry file to tell a script when to kill the proxy.
Both seem like they are equal (or better?) options to an endpoint until k8s adds better support.
@villesau Does the term_timeout
flag work for your use case?
term_timeout
should work for our case (similar to @villesau) but I feel like the /shutdown
endpoint is safer. I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout
forces me to set an arbitrary time to sleep, SIGTERM
, set an arbitrary timeout.
There's some prior art here with Linkerd: https://github.com/linkerd/linkerd2-proxy/pull/811
The correct answer is definitely Kubernetes handling this natively. In every other case it's a workaround.
I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout forces me to set an arbitrary time to sleep
@arbourd Can you clarify this? To be clear, term_timeout
it sets a MAXIMUM amount of time it waits before shutting down. It'll shut down early if all the connections are closed.
Understood. Some of our jobs may run for many hours and some others (like migration) complete reasonably. The migration task is far far more common. Feels strange to set a 5h term_timeout
. The workaround (trapping EXIT) is what we're using at the moment.
I just run into this while trying to use the proxy for postgres migrations. What do you people think about something like an -idle-timeout x
flag that would trigger a normal shutdown after x seconds if no new connections came in after the last connection finished? That could be useful for other type of jobs too, maybe it could be a good compromise while the community figures out how to manage container dependencies.
If we were to support this, I think following linkerd would be the way to go.
There's some prior art here with Linkerd: https://github.com/linkerd/linkerd2-proxy/pull/811
https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/828#issuecomment-947063317
Yes, I also would love to see this implemented. My use case is documented in #1115, but briefly, it is to use with Apache Airflow when deployed on Kubernetes using their official Helm chart, and the sidecar needs to be killed semi-manually after the main app is done its work with the database.
Another option would be to follow Istio with its /quitquitquit
endpoint.
Having an explicit shutdown endpoint (/quitquitquit
I think would be a great option following Istio example) would definitely help. Additionally:
It'll shut down early if all the connections are closed.
This is not the behavior I am currently seeing with proxy sidecar integration with pega-helm-charts installer subchart. After installer container completion, and checking cloud-sql-proxy logs, there are just as many "New connection for..." as "Client closed local connection on 127.0.0.1:5432" and still the sql-proxy container is running + refreshing ephemeral certificate even though main container has shut down cleanly (closing all connections)
To be clear, setting term_timeout
makes the proxy wait so long after it receives a SIGTERM.
A workaround proposed elsewhere (not finding the comment at the moment) is to start the proxy with the alpine or buster container like so:
timeout 60 /cloud_sql_proxy --instances=<...> --term_timeout=3600s
The idea here is that the timeout
command sends a SIGTERM
so many seconds after starting (60s here), and then the proxy waits up to the term_timeout
duration for all the connections to close before shutting down (3600s, or longer). If all connections close before the duration has elapsed, the proxy will shutdown early.
Ah thanks @enocom it seems I was missing the part of actually sending the proxy a SIGTERM. Using timeout
wrapper on the command, or having a shared volume that our installer container can write to where proxy is notified for shutdown on a given file's presence both seem to be viable ways of initiating this SIGTERM. We could still set a term_timeout
, but this may not be needed as all connections are being closed
For completeness, here's a blog post that describes how to use a PID file to accomplish the same.
In plain bash the approach translates to:
$ /cloud_sql_proxy -instances=my-project:us-central1:my-instance=tcp:3306 \
-credential_file=/secrets/cloudsql/credentials.json & CHILD_PID=$!
$ (while true; do if [[ -f "/tmp/pod/terminated" ]]; then
kill $CHILD_PID; echo "Killed $CHILD_PID because the main container terminated."
fi; sleep 1; done) &
$ wait $CHILD_PID
$ if [[ -f "/tmp/pod/terminated" ]]; then exit 0; echo "Job completed. Exiting..."; fi
I think an endpoint like /shutdown
, or reading a file to be written, is likely the only solution that covers most of the different ways the sql proxy could be used. The endpoint method seems easier since it wouldn't require the use of a shared volume.
Relying on timeouts could result in flaky behavior because it makes an assumption that past a certain time in the future if the connections drop to 0, then the proxy can exit. I believe that is coming from an assumption that the other container would only be using 1 connection, and that it is reasonable to try to predict how long it will take database operations to complete or the other containers themselves to start. IMO those are not reasonable assumptions; therefore, I think it needs to be stated that using timeouts to enter the exiting state is not an acceptable solution for this issue. However, using timeouts for then connections to close after the proxy being explicitly to shutdown seem perfectly fine.
Adding a suggestion from @taylorchu from here:
We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.
Random thought: we could add support for SIGQUIT or similar to have the Proxy shutdown. That might sidestep the problems of securing an HTTP endpoint.
The current plan is to add a localhost-only /quitquitquit
endpoint. If there are any open connections the proxy will exit with a 143 code (the conventional SIGTERM code) and 0 otherwise if the shutdown is clean.
I found quitquitquit endpoint is not very scalable, and could easily create race conditions.
For example,
- you have a pod with 1 keystone container and 2 sidecars. Both sidecars will need to have this quitquitquit-like endpoint (they might even have different http status code, and http path), and the keystone app needs to adapt and tightly couple to these sidecars.
- you have a pod with 2 keystone containers, which depend on 1 sidecar. Both keystone containers could send to quitquitquit endpoint. If sidecar exits from quitquitquit, then the other keystone will need to watch for a broken connection and then decide to exit.
Now imagine the count is not 2 but >= 3, and you have many apps like this. All this additional/complex code added is because we run things on k8s; the application is bloated because of the "environment" you run it on.
I think this issue is more of a k8s problem and therefore out of scope.
It's definitely a K8s problem. The idea here is to ease the pain for people running jobs or init containers. And we'll make it opt-in with a CLI flag, so by default the behavior will be off.
What's a keystone container? I don't know that phrase and don't find much when searching for it.
https://github.com/kubernetes/enhancements/issues/2872
We'll be releasing this in the next version of v2 currently scheduled for February 21.
@enocom question about this, will you also provide a way to have it ignore sigterm as well or is that already possible?
Consider the following case:
- Pod scales down
- App AND proxy get a sigterm (cause they are in the same pod)
- App's cleanup routine requires it to make new connections to the DB after getting a sigterm
- The proxy will refuse new connections after getting a sigterm
- This race condition shuts down the proxy before the app can access the DB
It doesn't matter if there is an endpoint the app can hit if the proxy is getting a sigterm from k8s. There needs to be a way to keep the proxy running.
I could use the entrypoint config hacks I've seen in github pages to catch sigterm in the entrypoint and ignore but it would be nice to support this in the proxy process itself! Maybe another endpoint that could be hit to cancel shutdown the app could call? Or will the quitquitquit endpoint support setting a hard time out so they app could force it to stay alive for X seconds when getting a sigterm?
Take a look at the --max-sigterm-delay
flag which should handle the use case you're describing.
--max-sigterm-delay
So I guess I was misunderstanding how that worked. I though using that meant it would stay alive only while there are pending connections. I'm actually using that switch currently. Your when the proxy gets a sig term and --max-sigterm-delay
is in effect it will stay alive for the delay I set even if there are no open connections?
So I'm wrong in thinking The proxy will refuse new connections after getting a sigterm
if --max-sigterm-delay
is set?
@enocom I'm actually using V1 and term_timeout
, maybe the behavior changed in V2?
If you have open connections, the proxy will stay alive until the duration elapses. Otherwise it will shut down early. In either case, let's pull this discussion into a new feature request and we can explore whether there's something that needs to be changed here or elsewhere.
@enocom OK I created a feature request explaining my use case. Thanks!
https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1640
In case someone was struggling with the same problem. I've come up with something working for me.
command: [ "sh", "-c", "python manage.py migrate && pkill -2 pgbouncer && curl -X POST localhost:9091/quitquitquit" ]
The flag for -term_timeout doesn't work and I'm going to have to insist on the fact that it does not work. It works locally, yes, on my Mac, but does not work in the gcr.io/cloudsql-docker/gce-proxy container where it should absolutely work.
Let's talk through the details here: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1742.