vitess
vitess copied to clipboard
Feature Request: Vtgate graceful shutdown
Feature Description
Currently, a new release rollout is somewhat painful in that it creates connection errors when we restart the Vtgate service on our several hundred Vtgate hosts. To minimize this pain/disruption, we currently depool a handful of Vtgate hosts from their load balancer at a time so they don't get any new connections and wait a few minutes before restarting the Vtgate service. Even with these measures, which makes the process quite slow, we still experience some errors.
It would be very helpful for the Vtgate service to have a sort of "graceful shutdown" in which the service would no longer accept new connections but instead drain all existing connections until there are no active ones. This would, in essence, make client connection errors disappear and would allow us to deploy new releases faster and without disruption.
Use Case(s)
Version upgrades, which require a restart of all vtgates would be much smoother if we could stop the service gracefully and without end-user impact.
Could you detail the type of errors you see?
Sorry, I had not seen this comment. The errors we see are "mysql server has gone away".
If you're using a gRPC based driver like Go, and I believe the Vitess JDBC driver, it should silently reconnect to another vtgate.
We've also run into this, and believe that we have a fix ready.
The current vtgate shutdown procedure looks like this:
vtgatewill executeOnTermcallbacks:- The mysql listener is shut down - no new connections to
vtgatecan be opened. When a load-balancer is placed in front of vtgate, new connections will be redirect to othervtgateinstances that are not currently shutting down. vtgatewaits for upto--onterm_timeoutseconds for connections tovtgateto become non-busy. Unfortunately, during a shutdown, there's nothing that prevents connections to go back from being non-busy to being busy (new queries that come in via already established connections can start executing).
- The mysql listener is shut down - no new connections to
- After
--onterm_timeouthas passed,vtgatewill executeOnClosecallbacks:- The gateway is shut down (grpc connections to tablets are closed)
- All incoming MySQL connections are closed forcefully.
What we see is that a lot of queries will still be executing when the onterm_timeout is passed, and those queries often fail with vttablet: Connection Closed errors because the grpc tablet connections are closed while queries are executing.
Our proposed fix is to change vtgate to decline new queries on connections that are not currently in a transactions during shutdown with a 1053 - Server is shutting down error, and then close them. This way, once the shutdown process is started, actively used connections are closed fairly fast, so by the time the --onterm_timeout has passed, most connections outside of transactions have already been closed.
For connections that are inside of a transaction, there is a grace period of --onterm_timeout seconds to finish their transactions - otherwise the connection will be closed forcefully (as is the case currently).
In our testing, the above change brings down the non-shutdown errors seen by the application down to 0, and allows us to restart vtgate processes without any application impact.
In our Rails applications, we're using a customized version of the Trilogy adapter for connecting to MySQL and vtgate. The adapter transparently reconnects MySQL connections (outside of transactions) that encounter a 1053 - Server is shutting down error. Other applications can implement similar reconnect / retry logic when encountering this error. We employ this logic for both "vanilla" MySQL as well as vtgate.
@arthurschreiber / @davidpiegza really happy to see your proposed fix! We were able to prevent new connections from being routed to vtgates that were shutting down by sleeping for several seconds in a kubernetes preStop hook as described here: https://learnk8s.io/graceful-shutdown#deleting-a-pod, but we didn't have a solution for existing connections that continued to issue queries. I think returning a clear error to the client makes a lot of sense.
The PR is now ready for review: https://github.com/vitessio/vitess/pull/14219
do we have anything pending on this feature request?