Confusing error handling for KeyboardInterrupt
I reported this issue on the Python tracker but got no answer.
If you trigger KeyboardInterrupt in a coroutine and catch it, the program terminates cleanly:
import asyncio
async def bar():
raise KeyboardInterrupt
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(bar())
except KeyboardInterrupt:
print("It's ok")
finally:
loop.stop()
loop.close()
This outputs:
It's ok
However, if you wrap the coroutine in a Task, you will get a mixed behavior:
try:
task = asyncio.ensure_future(bar())
loop.run_until_complete(task)
except KeyboardInterrupt:
print("It's ok")
This outputs:
It's ok
Task exception was never retrieved
future: <Task finished coro=<bar() done, defined at ki_bug.py:4> exception=KeyboardInterrupt()>
Traceback (most recent call last):
File "ki_bug.py", line 10, in <module>
loop.run_until_complete(main_future)
File "/usr/lib/python3.5/asyncio/base_events.py", line 325, in run_until_complete
self.run_forever()
File "/usr/lib/python3.5/asyncio/base_events.py", line 295, in run_forever
self._run_once()
File "/usr/lib/python3.5/asyncio/base_events.py", line 1258, in _run_once
handle._run()
File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
self._callback(*self._args)
File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
result = coro.send(None)
File "ki_bug.py", line 5, in bar
raise KeyboardInterrupt
KeyboardInterrupt
We have several contradictory behaviors: the KeyboardInterrupt is raised, and captured by the future (since your can do task.exception() to suppress the stracktrace) but also catched by except while the program is allowed to continue, yet still the stack trace is displayed and eventually the program return code will be 0.
It's very confusing.
I believe it's because exceptions don't break out the event loop, but KeyboardInterrupt is having some kind special treatment.
There is a bit of explanation in BaseEventLoop.run_until_complete:
try:
self.run_forever()
except:
if new_task and future.done() and not future.cancelled():
# The coroutine raised a BaseException. Consume the exception
# to not log a warning, the caller doesn't have access to the
# local task.
future.exception()
raise
If the exception is an instance of BaseException but not Exception (such as KeyboardInterrupt), it bubbles up and reaches this try-except. If the task has been created inside run_until_complete (new_task is True), then the exception is consumed since there is no way to access the task before it is garbage collected.
The problem you reported could easily be fixed by removing new_task in the test. However, the raised BaseException might be different from the exception set in the task, so I guess it is more of a feature than a bug. Another way to address this issue would be to compare the exceptions:
try:
self.run_forever()
except BaseException as exc:
if future._exception is exc \
# Is this line still necessary?
or new_task and future.done() and not future.cancelled():
# The coroutine raised a BaseException. Consume the exception
# to not log a warning, the caller doesn't have access to the
# local task.
future.exception()
raise
However, this part of the code is very central, so I guess any change here is a pretty big deal.
Thank you.
We manage to live with it, it's just debugging will confuse the hell out of new comers. Asyncio is pretty hard to understand, and even more to debug, so I believe we should do everything to make it easier. I have been working on an asyncio project for some month now, and I must say there is no way an average programmer would have the patience to dig into all the debugging sessions we went through.
It has been suggested that asyncio should properly catch and deliver all exceptions, even BaseException. I am beginning to think that we should really do that! There's even an old pull request to do it: https://github.com/python/asyncio/pull/305
Probably needs updating and I worry that it would break other things (that's why the PR hasn't been merged yet). But still sounds like it would generally be an improvement. @asvetlov What do you think? Would aiohttp need changes if we did this?
@gvanrossum But doesn't it make sense for a KeyboardInterrupt to bubble up to loop.run_forever and stop the loop regardless of where the exception has been raised?
I'm thinking about how the PR #305 might break the servers relying on Ctrl+C to stop their execution. Consider the TCP echo server using streams from the asyncio documentation. If a KeyboardInterrupt is raised while waiting for a client to connect (in loop._run_once), the execution will stop as expected. But if the KeyboardInterrupt is raised while dealing with a client request (in the handle_echo coroutine), the loop will not stop ( and a Task exception was never retrieved message will be logged). This inconsistent behavior seems like a pretty serious issue.
That's the kind of thing we need to solve before accepting the PR. Do you have a suggestion for a fix?
On Sunday, July 31, 2016, Vincent Michel [email protected] wrote:
@gvanrossum https://github.com/gvanrossum But doesn't it make sense for a KeyboardInterrupt to bubble up to loop.run_forever and stop the loop regardless of where the exception has been raised?
I'm thinking about how the PR #305 https://github.com/python/asyncio/pull/305 might break the servers relying on Ctrl+C to stop their execution. Consider the TCP echo server using streams https://docs.python.org/3.4/library/asyncio-stream.html#tcp-echo-server-using-streams from the asyncio documentation. If a KeyboardInterrupt is raised while waiting for a client to connect (in loop._run_once), the execution will stop as expected. But if the KeyboardInterrupt is raised while dealing with a client request (in the handle_echo coroutine), the loop will not stop ( and a Task exception was never retrieved message will be logged). This inconsistent behavior seems like a pretty serious issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/python/asyncio/issues/341#issuecomment-236424913, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwrMgNweN2dEIArfWBzze_StbYddxSjks5qbIa_gaJpZM4IWvsV .
--Guido (mobile)
@gvanrossum
I like @haypo's idea of setting a handler for SIGINT. A keyboard interrupt is an asynchronous event, and it makes a lot of sense to let the event loop deal with it. It also enforces the idea that the code between two await statements is safe and will never be interrupted.
The default SIGINT handler could stop the loop, and make sure that a KeyboardInterrupt exception is raised in loop.run_forever so the application can react and let the running coroutines terminate properly.
You mentioned the fact that a SIGINT won't be able to interrupt a tight CPU loop, but I don't really see that as a problem.
@gvanrossum I don't see any harm from #305 aiohttp (and all other libraries supported by me) definitely doesn't change own behavior after PR appliance.
You mentioned that a SIGINT won't be able to interrupt a tight CPU loop, but I don't really see that as a problem.
@vxgmichel This is a big no-no. In the first version of uvloop I did exactly this -- handle SIGINT and let the loop to handle it asynchronously. It was completely unusable. Turns out people write tight loops quite frequently, and inability to stop your Python program with Ctrl-C is something they aren't prepared to handle at all. Now I'm using PyErr_SetInterrupt to actually interrupt any Python code on SIGINT to replicate asyncio behaviour.
I think the problem with the TCP server example can be fixed, with some care. The example hangs with a similar error if the handler raises a non-blocking exception (without the BaseException PR), and I think the problem is that asyncio.start_server() is a bit naive: it creates a Task and then forgets about it. It should at the very least collect errors from the Tasks it has created, and that error-catching code could special-case KeyboardInterrupt.
Also, assuming you're not being DoS'ed with continued connect requests, hitting ^C again will exit the loop.
I crafted another example to demonstrate that some programs might get non-interruptible after PR #305. In this example, the background task is most likely to catch and ignore the first Ctrl-C (the second one will interrupt the program though).
@1st1
[...] inability to stop your Python program with Ctrl-C is something they aren't prepared to handle at all.
That is also what gets me worried about PR #305. I'd sum it up this way:
Ctrl-C stops the loop, ...
- [master] ... but asyncio internals might get corrupted if you try to recover.
- [PR #305] ... unless a background task is being executed when
KeyboardInterruptis raised. - [PR #305 +
SIGINThandler] ... unless there is synchronous code in your coroutines or callbacks.
Then I would like to evolve #305 into something that always stops the loop on ^C, unless the application takes specific measures (e.g. installing a SIGINT handler, or perhaps a try/except that catches KeyboardInterrupt).
There is an alternative possibility:
- catch all exceptions;
- add a system to stop the loop and reraise some exceptions;
- add a configuration system to chose the exceptions to catch and reraise. We should have hooks to create custom handlers for thoose;
- make the default conf values to be catching keyboard interrupt and reraise it;
- then document that and let the user choose the behavior;
Having a generic middleware system to handle exception and decide which one stop the loop, which one you log and which one you silent would be a nice thing anyway.
catch all exceptions; add a system to stop the loop and reraise some exceptions;
This sounds right.
add a configuration system to chose the exceptions to catch and reraise. We should have hooks to create custom handlers for thoose;
I'm not sure I like this. What's the use case? What if you have two asyncio libs requiring different configs?
Two async libs can already change the event loop, the policy and some the signal handlers. There is nothing to stop conflicting libs to screw with each others except documentation and good will.
The use case for this is the creation of a framework, and the definition of the error handling policy for this framework : logging, recovery, debugging, etc.
E.G:
On the project we are working now, we want to make debugging in dev mode very easy, especially for beginners. So we go to great lengths to detect common or weird errors and ensure the user can easily spot them, understand the problem and choose a fix.
Unfortunately, with the current architecture, this is complicated, we had to:
- create a fail fast mode so that if you activate it, any error just break the server and print a very verbose stack trace for debugging: https://github.com/Tygs/tygs/blob/master/src/tygs/app.py#L22
- install some convoluted error handling system : https://github.com/Tygs/tygs/blob/master/src/tygs/app.py#L113 and https://github.com/Tygs/tygs/blob/master/src/tygs/app.py#L156
- make sure you deal with exceptions that are not errors but workflow handling because in the web they are very common : https://github.com/Tygs/tygs/blob/master/src/tygs/components.py#L167
- deal with the fact anybody can hijack the loop and stop or start it when it pleases him/her : https://github.com/Tygs/tygs/blob/master/src/tygs/app.py#L84
- take the user by the hand and ensure we got the coroutines and awaitables where we need it : https://github.com/Tygs/tygs/blob/master/src/tygs/app.py#L84
And yet, while I think 30% of our unit tests are for those (https://github.com/Tygs/tygs/tree/master/tests), we have still some problems such as some errors being silent in unit tests with pytest.
And after all that, and 100% code coverage, we are maybe at 40% of the feature we expect for v0.1, as this took us many more iterations than we expected to get right. Not complaining, I'm glad we have asyncio in Python, and I'm super excited about being able to code this, but let's use the opportunity to make the next people's life easier.
Error handling, life cycle and debugging are really a key component to get right in asyncio, especially since the handling of the loop is manual, explicit and open to wind. NodeJS dev don't have to worry about half of that.
@sametmax Thanks for the explanation. I'll take a closer look at #305 (probably will have to be rewritten) in a few days.
And yet we have still some problems such as some errors being silent in unit tests with pytest.
Yeah, I saw that with pytest. Was wondering what's going on but didn't have time to investigate in detail.
I ran into similar issues as @sametmax.
The principles I used, in short:
- Two types of tasks: dependent and independent; the independent ones (started by an ensure_future wrapper) are appended to a list; dependent tasks started either with gather() or coming from main are used for synchronous jobs (eg REQ-REP scenarios); independent ones are used for asynchronous communications (XREQ-XREP, PUB-SUB)
- An exception handler which checks every 1 sec if there are any done() tasks in the tracked list and displays the exception (does not raise it) and then removes that task from the list
- Use a state machine with { running, soft-stopping, communication-stopping, hard-stopping, stopped }
- running is the default when starting the loop
- signal-handler for SIGTERM, SIGINT transitions into soft-stop and sets a flag to notify all tasks they need to terminate gracefully pretty soon (including the exception handler and socket close)
- soft-stopping transitions (sleep) into communication-stopping after a TIMEOUT (here we destroy the ZMQ Context)
- communication-stopping transitions (sleep) into hard-stopping after a different TIMEOUT (here we cancel all independent tasks which are not yet done())
- hard-stopping transitions (call_later) into stopped after a different TIMEOUT (by now the dependent tasks should have finished properly and we call loop.stop())
- The main coroutine starts the exception handler, gathers() one co-routine passed as parameter to be run, then gathers() the exception handler in case we get an exception in the exception handler and finally if still in running mode, starts the whole unwinding procedure above (without the loop.stop)
- The main coroutine is ran by loop.run_until_complete() inside a try-catch-finally block where we finally display any exceptions that the exception handler might have missed since the soft-stopping started until the end of the winding down; the catch pass'es on any Cancelled exceptions
- At the end i do a loop.close() and sys.exit(0) just in case there's something still living
The more I think about it, the more I realise the event loop is kinda half a state machine. And each of us a reinventing the rest, including defined states, event propagation, error handling, etc., so we can have a clean use of asyncio.
This thread is running off the rails. Asyncio is a much lower level than nodejs, like it or not. Somebody should probably write an opinionated framework for writing servers on top of it, but such a framework does not belong in the stdlib -- just like a web framework does not belong in the stdlib even though sockets are. Different groups may write, promote and adopt different frameworks, just like for web frameworks we have Pyramid, Django, Flask, and many others. (And if you want to discuss this more, please start a new issue so it doesn't add noise to this issue. Or go to python-ideas.)
Re-focusing on KeyboardInterrupt, I think the deep problem is due to the way signals work and are expected to work. When you have a callback that somehow doesn't return, e.g. because it contains while True: pass, the signal is the only way to break through that (and note that Python itself already does some hackery here to ensure the exception is raised at the boundary between two opcodes). Here loop.add_signal_handler() would prevent desirable functionality, since it would make such a while-True loop essentially unstoppable.
But when the event loop or some other thing (e.g. a future or queue) is updating its own state we probably don't want the signal to interrupt, and there the behavior of loop.add_signal_handler() is desirable. (Which, BTW, at the scale of callbacks, does the same thing that Python itself does at the scale of opcodes -- delaying the "user" code to handle the signal until a clean breaking point.)
Maybe we should mask SIGINT except when callbacks are running or when we're blocked in the selector? And maybe Task._setup() should not call set_exception() when it catches BaseException? (Perhaps instead it should cancel the task?) Then SIGINT will still interrupt callbacks, and it will also interrupt I/O waits, and it can then just be raised out of run_forever() or run_until_complete() without changing any internal state.
That's still thinking aloud, but with a focus on the known use cases for ^C.
Would it be possible to get an update on this issue? Did some of the described behaviours changed since then?