cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

Provide a way to shutdown the process from another container

Open villesau opened this issue 3 years ago • 15 comments

Feature Description

It is a common practice to run database migrations in an init container and wait for the migration job to finish. However, when the proxy never shuts down, init script stalls. Would be good to provide a way to shut down the proxy gracefully. For example an endpoint for other containers to call when finished.

Alternatives Considered

Not sure really. If Kubernetes had a native way to shutdown the whole pod and deliver the signal to the container, that could work too. But I think an endpoint could be easier and more straightforward.

Additional Context

An example tutorial on how to run migrations with init containers, kubernetes jobs and k8s-wait-for: https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-8-running-database-migrations-using-jobs-and-init-containers/

villesau avatar Jun 28 '21 10:06 villesau

Seems that there was a proposal for kubernetes native functionality for this, but it will be reworked and there is no timeline for it: https://github.com/kubernetes/enhancements/issues/753

I think an endpoint could be a good workaround for this in the meantime from the user perspective.

villesau avatar Jun 28 '21 15:06 villesau

I'm not entirely convinced an endpoint makes the most sense here. This problem was discussed in #128. #206 added the -term_timeout flag to close that issue.

You could start the proxy with this flag, sleep for some amount of time for your application to start up, and then send a sigterm to the proxy. It'll exit gracefully when the active connections hit 0 (or ungracefully when the term_timeout is reached).

Additionally, this method was suggested as a workaround in the original issue. It essentially uses a sentry file to tell a script when to kill the proxy.

Both seem like they are equal (or better?) options to an endpoint until k8s adds better support.

kurtisvg avatar Jul 07 '21 21:07 kurtisvg

@villesau Does the term_timeout flag work for your use case?

enocom avatar Jul 13 '21 19:07 enocom

term_timeout should work for our case (similar to @villesau) but I feel like the /shutdown endpoint is safer. I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout forces me to set an arbitrary time to sleep, SIGTERM, set an arbitrary timeout.

There's some prior art here with Linkerd: https://github.com/linkerd/linkerd2-proxy/pull/811

The correct answer is definitely Kubernetes handling this natively. In every other case it's a workaround.

arbourd avatar Oct 19 '21 20:10 arbourd

I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout forces me to set an arbitrary time to sleep

@arbourd Can you clarify this? To be clear, term_timeout it sets a MAXIMUM amount of time it waits before shutting down. It'll shut down early if all the connections are closed.

kurtisvg avatar Oct 20 '21 16:10 kurtisvg

Understood. Some of our jobs may run for many hours and some others (like migration) complete reasonably. The migration task is far far more common. Feels strange to set a 5h term_timeout. The workaround (trapping EXIT) is what we're using at the moment.

arbourd avatar Oct 20 '21 17:10 arbourd

I just run into this while trying to use the proxy for postgres migrations. What do you people think about something like an -idle-timeout x flag that would trigger a normal shutdown after x seconds if no new connections came in after the last connection finished? That could be useful for other type of jobs too, maybe it could be a good compromise while the community figures out how to manage container dependencies.

ettancos avatar Feb 11 '22 11:02 ettancos

If we were to support this, I think following linkerd would be the way to go.

There's some prior art here with Linkerd: https://github.com/linkerd/linkerd2-proxy/pull/811

https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/828#issuecomment-947063317

enocom avatar Feb 11 '22 17:02 enocom

Yes, I also would love to see this implemented. My use case is documented in #1115, but briefly, it is to use with Apache Airflow when deployed on Kubernetes using their official Helm chart, and the sidecar needs to be killed semi-manually after the main app is done its work with the database.

yehoshuadimarsky avatar Feb 25 '22 04:02 yehoshuadimarsky

Another option would be to follow Istio with its /quitquitquit endpoint.

enocom avatar Mar 14 '22 19:03 enocom

Having an explicit shutdown endpoint (/quitquitquit I think would be a great option following Istio example) would definitely help. Additionally:

It'll shut down early if all the connections are closed.

This is not the behavior I am currently seeing with proxy sidecar integration with pega-helm-charts installer subchart. After installer container completion, and checking cloud-sql-proxy logs, there are just as many "New connection for..." as "Client closed local connection on 127.0.0.1:5432" and still the sql-proxy container is running + refreshing ephemeral certificate even though main container has shut down cleanly (closing all connections)

zitikay avatar Mar 15 '22 16:03 zitikay

To be clear, setting term_timeout makes the proxy wait so long after it receives a SIGTERM.

A workaround proposed elsewhere (not finding the comment at the moment) is to start the proxy with the alpine or buster container like so:

timeout 60 /cloud_sql_proxy --instances=<...> --term_timeout=3600s

The idea here is that the timeout command sends a SIGTERM so many seconds after starting (60s here), and then the proxy waits up to the term_timeout duration for all the connections to close before shutting down (3600s, or longer). If all connections close before the duration has elapsed, the proxy will shutdown early.

enocom avatar Mar 15 '22 16:03 enocom

Ah thanks @enocom it seems I was missing the part of actually sending the proxy a SIGTERM. Using timeout wrapper on the command, or having a shared volume that our installer container can write to where proxy is notified for shutdown on a given file's presence both seem to be viable ways of initiating this SIGTERM. We could still set a term_timeout, but this may not be needed as all connections are being closed

zitikay avatar Mar 15 '22 17:03 zitikay

For completeness, here's a blog post that describes how to use a PID file to accomplish the same.

In plain bash the approach translates to:

$ /cloud_sql_proxy -instances=my-project:us-central1:my-instance=tcp:3306 \
  -credential_file=/secrets/cloudsql/credentials.json & CHILD_PID=$!
$ (while true; do if [[ -f "/tmp/pod/terminated" ]]; then 
    kill $CHILD_PID; echo "Killed $CHILD_PID because the main container terminated."
  fi; sleep 1; done) &
$ wait $CHILD_PID
$ if [[ -f "/tmp/pod/terminated" ]]; then exit 0; echo "Job completed. Exiting..."; fi

enocom avatar Mar 16 '22 20:03 enocom

I think an endpoint like /shutdown, or reading a file to be written, is likely the only solution that covers most of the different ways the sql proxy could be used. The endpoint method seems easier since it wouldn't require the use of a shared volume.

Relying on timeouts could result in flaky behavior because it makes an assumption that past a certain time in the future if the connections drop to 0, then the proxy can exit. I believe that is coming from an assumption that the other container would only be using 1 connection, and that it is reasonable to try to predict how long it will take database operations to complete or the other containers themselves to start. IMO those are not reasonable assumptions; therefore, I think it needs to be stated that using timeouts to enter the exiting state is not an acceptable solution for this issue. However, using timeouts for then connections to close after the proxy being explicitly to shutdown seem perfectly fine.

kriswuollett avatar Apr 25 '22 15:04 kriswuollett

Adding a suggestion from @taylorchu from here:

We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.

enocom avatar Nov 02 '22 18:11 enocom

Random thought: we could add support for SIGQUIT or similar to have the Proxy shutdown. That might sidestep the problems of securing an HTTP endpoint.

enocom avatar Nov 08 '22 18:11 enocom

The current plan is to add a localhost-only /quitquitquit endpoint. If there are any open connections the proxy will exit with a 143 code (the conventional SIGTERM code) and 0 otherwise if the shutdown is clean.

enocom avatar Nov 09 '22 18:11 enocom

I found quitquitquit endpoint is not very scalable, and could easily create race conditions.

For example,

  1. you have a pod with 1 keystone container and 2 sidecars. Both sidecars will need to have this quitquitquit-like endpoint (they might even have different http status code, and http path), and the keystone app needs to adapt and tightly couple to these sidecars.
  2. you have a pod with 2 keystone containers, which depend on 1 sidecar. Both keystone containers could send to quitquitquit endpoint. If sidecar exits from quitquitquit, then the other keystone will need to watch for a broken connection and then decide to exit.

Now imagine the count is not 2 but >= 3, and you have many apps like this. All this additional/complex code added is because we run things on k8s; the application is bloated because of the "environment" you run it on.

I think this issue is more of a k8s problem and therefore out of scope.

taylorchu avatar Nov 10 '22 00:11 taylorchu

It's definitely a K8s problem. The idea here is to ease the pain for people running jobs or init containers. And we'll make it opt-in with a CLI flag, so by default the behavior will be off.

What's a keystone container? I don't know that phrase and don't find much when searching for it.

enocom avatar Nov 10 '22 01:11 enocom

https://github.com/kubernetes/enhancements/issues/2872

taylorchu avatar Nov 10 '22 02:11 taylorchu

We'll be releasing this in the next version of v2 currently scheduled for February 21.

enocom avatar Jan 31 '23 21:01 enocom

@enocom question about this, will you also provide a way to have it ignore sigterm as well or is that already possible?

Consider the following case:

  1. Pod scales down
  2. App AND proxy get a sigterm (cause they are in the same pod)
  3. App's cleanup routine requires it to make new connections to the DB after getting a sigterm
  4. The proxy will refuse new connections after getting a sigterm
  5. This race condition shuts down the proxy before the app can access the DB

It doesn't matter if there is an endpoint the app can hit if the proxy is getting a sigterm from k8s. There needs to be a way to keep the proxy running.

I could use the entrypoint config hacks I've seen in github pages to catch sigterm in the entrypoint and ignore but it would be nice to support this in the proxy process itself! Maybe another endpoint that could be hit to cancel shutdown the app could call? Or will the quitquitquit endpoint support setting a hard time out so they app could force it to stay alive for X seconds when getting a sigterm?

red8888 avatar Feb 08 '23 19:02 red8888

Take a look at the --max-sigterm-delay flag which should handle the use case you're describing.

enocom avatar Feb 08 '23 19:02 enocom

--max-sigterm-delay

So I guess I was misunderstanding how that worked. I though using that meant it would stay alive only while there are pending connections. I'm actually using that switch currently. Your when the proxy gets a sig term and --max-sigterm-delay is in effect it will stay alive for the delay I set even if there are no open connections?

So I'm wrong in thinking The proxy will refuse new connections after getting a sigterm if --max-sigterm-delay is set?

@enocom I'm actually using V1 and term_timeout, maybe the behavior changed in V2?

red8888 avatar Feb 08 '23 19:02 red8888

If you have open connections, the proxy will stay alive until the duration elapses. Otherwise it will shut down early. In either case, let's pull this discussion into a new feature request and we can explore whether there's something that needs to be changed here or elsewhere.

enocom avatar Feb 08 '23 19:02 enocom

@enocom OK I created a feature request explaining my use case. Thanks!

https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1640

red8888 avatar Feb 08 '23 20:02 red8888

In case someone was struggling with the same problem. I've come up with something working for me.

command: [ "sh", "-c", "python manage.py migrate && pkill -2 pgbouncer && curl -X POST localhost:9091/quitquitquit" ]

dastanko avatar Mar 30 '23 18:03 dastanko

The flag for -term_timeout doesn't work and I'm going to have to insist on the fact that it does not work. It works locally, yes, on my Mac, but does not work in the gcr.io/cloudsql-docker/gce-proxy container where it should absolutely work.

bdt-cd-bot avatar Apr 12 '23 20:04 bdt-cd-bot

Let's talk through the details here: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1742.

enocom avatar Apr 13 '23 15:04 enocom