tornado icon indicating copy to clipboard operation
tornado copied to clipboard

any way to graceful exit tornado application?

Open 0x11-dev opened this issue 8 years ago • 19 comments

there are many gist show how to graceful exit tornado application like this:

stop_httpserver(http_server)
def try_stop_ioloop():
    io_loop = IOLoop.instance()
    if io_loop._callbacks or io_loop._timeouts:
        io_loop.add_timeout(time.time()+1, try_stop_ioloop)
    else:
        io_loop.stop()

try_stop_ioloop()

it's safy to test io_loop._timeouts or _callbacks ? or _timeouts can be ignored?

0x11-dev avatar Aug 08 '16 06:08 0x11-dev

https://gist.github.com/nicky-zs/6304878 here is one example using the method. some time main_ioloop._timeouts always have new _Timeout instance, we have to wait a long time for the _timeouts queue empty.

0x11-dev avatar Aug 08 '16 06:08 0x11-dev

I came here today to ask the exact same question. For after much reading in SO I came up with this example:

import logging
import signal
import time
import tornado.httpserver
import tornado.ioloop
import tornado.web
import tornado.options
from tornado import gen

logger = logging.getLogger()


signal_received = False


class BlockingHandler(tornado.web.RequestHandler):

    def get(self):
        logger.debug("Starting sleep")
        time.sleep(5)
        logger.debug("Sleep done")
        self.finish('block')


class AsyncHandler(tornado.web.RequestHandler):

    @gen.coroutine
    def get(self):
        logger.debug("Starting range 5")
        for i in range(5):
            logger.debug(i)
            yield gen.sleep(1)
        logger.debug("Range done")
        self.finish('async')


def start_server():
    urls = [
        (r'/', BlockingHandler),
        (r'/async', AsyncHandler),
    ]
    application = tornado.web.Application(urls)
    http_server = tornado.httpserver.HTTPServer(application)
    http_server.listen(8888)
    ioloop = tornado.ioloop.IOLoop.instance()

    def register_signal(sig, frame):
        global signal_received
        logger.info("%s received, stopping server" % sig)
        http_server.stop()  # no more requests are accepted
        signal_received = True

    def stop_on_signal():
        global signal_received
        if signal_received and not ioloop._callbacks:
            ioloop.stop()
            logger.info("IOLoop stopped")

    tornado.ioloop.PeriodicCallback(stop_on_signal, 1000).start()
    signal.signal(signal.SIGTERM, register_signal)
    logging.info("Starting server")
    ioloop.start()


if __name__ == '__main__':
    tornado.options.parse_command_line()
    start_server()

It has two problems:

  1. BlockingHandler works as expected (finishes the result and gives it to the client), but raises an Exception:
[E 160808 14:46:12 ioloop:633] Exception in callback None
    Traceback (most recent call last):
      File "/Users/margus/src/git/tornado-sigterm/venv/lib/python3.5/site-packages/tornado/ioloop.py", line 887, in start
        handler_func(fd_obj, events)
      File "/Users/margus/src/git/tornado-sigterm/venv/lib/python3.5/site-packages/tornado/stack_context.py", line 275, in null_wrapper
        return fn(*args, **kwargs)
      File "/Users/margus/src/git/tornado-sigterm/venv/lib/python3.5/site-packages/tornado/netutil.py", line 260, in accept_handler
        connection, address = sock.accept()
      File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/socket.py", line 195, in accept
        fd, addr = self._accept()
    OSError: [Errno 9] Bad file descriptor
[I 160808 14:46:12 application:58] IOLoop stopped

.. which I think means that the tornado.access logger is trying to write after the ioloop has stopped and all sockets to write to gone. Not sure whether a feature or a bug in the logging.

Second problem is with AsyncHandler, where request is killed and client receives Remote Disconnected error.

It would be very nice to get an official right way of approaching this. I'll now go and test https://gist.github.com/nicky-zs/6304878 in a real world.

kerma avatar Aug 08 '16 11:08 kerma

You definitely shouldn't be looking at IOLoop._callbacks or IOLoop._timeouts. In addition to being private variables, there's a good chance that they will never be empty, for reasons that are unimportant (if you use tornado_curl_httpclient there is always an active timeout). Instead of asking the IOLoop for all activity, you need to decide what activity you care about and exit the server when all of that activity is done.

Or you can do what I do and just stop the IOLoop 5 seconds after the stop is requested. This will be enough for any regular request to finish, and if it's taking longer than 5 seconds you probably want to let it fail anyway. It's not worth making a more precise measurement to stop in less than 5 seconds.

bdarnell avatar Aug 09 '16 02:08 bdarnell

This seems like a question that is asked frequently enough that an example solution should be included in the user's guide section of the docs. I would consider adding something to either the Running and deploying section or the Structure section. Ben, which do you think makes the most sense?

mivade avatar Aug 12 '16 20:08 mivade

Set a timeout for stop IOloop is not safe. Is there a approach available to test all request are processed?

0x11-dev avatar Aug 13 '16 01:08 0x11-dev

Yeah, I think adding something to "Running and deploying" makes sense. There are two tricky things in writing this up: A) The right way to do this depends heavily on your deployment environment (how your load balancers do health checks and/or retries), and B) Convincing people who want to wait for all operations to finish that it's better to just use a timeout.

On that latter point: There must be some maximum time that you're willing to wait for an operation to finish, even if it's several minutes. At that point you have to be willing to cut them off or they'll keep consuming resources indefinitely. Once you've decided how long you're willing to wait for the last client operation to finish, you've already committed the resources to keep the old server processes around for that long, so why not leave them running for that duration in every case? Trying to track the number of operations remaining and stopping the server precisely when the count reaches zero is not worth the trouble IMHO.

bdarnell avatar Aug 19 '16 02:08 bdarnell

You are right that there are plenty of caveats to how to really do a graceful shutdown depending on the deployment strategy. But for the guide, maybe it would suffice to show a simple example case (such as what you mentioned where you stop the IOLoop 5 seconds later to give requests time to finish) while explaining that this is not the only way, but that it at least works for simple setups. I'll take a stab at this in the next few days and maybe we can try to iterate through a good solution via a pull request.

mivade avatar Aug 20 '16 15:08 mivade

It would be nice if we could specify what tornado should do on a SIGTERM. For my kind of setup it would be nice to tell it, that it should stop accepting requests after time X and should quit as soon as all active requests are gone, additionally it would be good to be able to check this status (like .inShutdown() == true)

This would make it quiet easy to gracefully stop it inside kubernetes by having an health handler which will start to become "faulty" as soon as .inShutdown() returns true. With this I can ensure that I take the service out of the loadbalancing without killing any active connections (if they don't take too long) and not loosing some because kubernetes is not taking it out of LB instantaneously (by design to not have one faulty response kicking the service out).

savar avatar Apr 28 '17 07:04 savar

We use Ubuntu's upstart to run our services. While terminating the service, upstart first sends SIGTERM, waits for the service to stop in the configured amount of time (by default timeout is 5s) and if the service fails to stop, sends SIGKILL.

I expect the behavior to be consistent across most of the *nix systems. It would be nice it Tornado can support these two signals by default along with a custom set of signals.

piyush-kansal avatar May 16 '17 00:05 piyush-kansal

Tornado works correctly under the upstart configuration you described - it dies by default when it gets SIGTERM. What would you like it to do differently? This always seems to be application- and deployment-dependent (especially with respect to how your load balancers do health checks and retries), but whatever behavior you want can be obtained by defining your own signal handlers.

bdarnell avatar May 21 '17 00:05 bdarnell

@bdarnell Thanks for your quick response. I think the reason it didn't work well for us is because we are using an older version of Tornado (2.1.1). Once I implemented my own signal handler for SIGTERM, things started working.

piyush-kansal avatar May 23 '17 18:05 piyush-kansal

How to shutdown tornado 5.1 gracefully?

90h avatar Jul 31 '18 10:07 90h

Here's how I do it (currently with tornado-4.5.3 but I expect it will work the same with tornado-5.1):

async def shutdown():
    periodic_task.stop()
    http_server.stop()
    for client in ws_clients.values():
        client['handler'].close()
    await gen.sleep(1)
    ioloop.IOLoop.current().stop()

def exit_handler(sig, frame):
    ioloop.IOLoop.instance().add_callback_from_signal(shutdown)

...
if __name__ == '__main__':
    signal.signal(signal.SIGTERM, exit_handler)
    signal.signal(signal.SIGINT,  exit_handler)
    ...

(instead of just gen.sleep(1) I actually have a global active-request count and a tornado.locks.Event() that is set after the last request has finished, but that's more complicated and more code ...)

ploxiln avatar Jul 31 '18 15:07 ploxiln

Yeah, something like @ploxiln's approach is what I'd do. I don't think there have been any noteworthy changes between Tornado 4 and 5 here. It all depends on what "gracefully" means for your application. (One missing piece in the snippet above is that you probably want to signal to your load balancer somehow to stop the incoming traffic).

bdarnell avatar Jul 31 '18 23:07 bdarnell

@ploxiln Hi mate, would you mind describing what ws_clients is in this case?

baratrion avatar Dec 26 '18 14:12 baratrion

sure, ws_clients in my case was a list of dicts with each dict containing a 'handler' reference to an instance of WebSocketHandler - in short-hand:

ws_clients = [
    {
        'handler': WebSocketHandler(...),
        'tags': ...
    },
    ...
]

I had a WebSocketHandler which, for each websocket client which connected, would add "self" to this list. On close, it would find and remove "self" from this list. On shutdown, all still-open websocket connections could be cleanly closed in this way (in addition to other plain http requests having a second to finish up, while no new plain http or websocket requests are possible).

ploxiln avatar Dec 30 '18 18:12 ploxiln

Linking to this gist in case anyone comes across this issue: https://gist.github.com/wonderbeyond/d38cd85243befe863cdde54b84505784

This works for me.

ewhauser avatar Mar 08 '20 23:03 ewhauser

This is what I'm doing on Firenado:

https://github.com/candango/firenado/blob/develop/firenado/launcher.py#L159-L324

piraz avatar Mar 09 '20 01:03 piraz

We managed to gracefully shutdown tornado by implementing the following steps:

  1. intercepting SIGINT/SIGTERM signals
  2. manually stop the HTTPServer right away (no more requests are accepted)
  3. manually stop the ioloop when all pending requests terminated

See https://github.com/svaponi/tornado-graceful-shutdown/blob/main/server.py

svaponi avatar Oct 22 '23 00:10 svaponi