ipykernel icon indicating copy to clipboard operation
ipykernel copied to clipboard

Replace Tornado with AnyIO

Open davidbrochart opened this issue 2 years ago • 30 comments

References

Supersedes #876. Closes #656.

IPykernel should not depend on a web server. It uses Tornado as an async framework, but modern Python has better solutions, including standard asyncio. Tornado's use in ipykernel involves using a lot of queues and creates complexity that makes maintenance or improvements to the code base harder. Structured concurrency as introduced by Trio is gaining a lot of traction. AnyIO brings structured concurrency on top of asyncio and acts as an abstraction layer which allows to choose a concurrency backend, such as asyncio or trio. Major projects have switched to AnyIO, including FastAPI.

This PR is still a work in progress. I have disabled some tests, like the ones for in-process kernels. Also, I think we should re-think event-loop integration (maybe using Trio's approach). But I would like to get feedback from the community and discuss how we can move forward with these changes.

cc @SylvainCorlay @JohanMabille @minrk @ccordoba12

Code changes

The dependency to Tornado is dropped, in favor of AnyIO. This means that instead of using ZMQStream with on_recv callbacks, we use async/await everywhere. Using AnyIO in the main thread event loop gives us Trio support for free in the user code. It is currently disabled until we get AnyIO support in pyzmq, or decide to process shell messages in a separate thread, which might be needed for sub-shells anyway.

User-facing changes

The goal is to have no user-facing change. However, I have identified the following ones:

  • handling of standard streams (stdout, stderr) is improved as there is no delay anymore.
  • interrupting async user code now works (see #881).
  • traceback when interrupting blocking user code leaks one frame into the kernel source code (might be fixable).

Backwards-incompatible changes

  • The stream traits (e.g. shell_stream) are replaced with their socket equivalent (e.g. shell_socket).
  • The API is now mostly async, including the Kernel.start method.
  • There is a new blocking Kernel.stop (thread-safe) method.

davidbrochart avatar Jan 25 '23 09:01 davidbrochart

+1 on AnyIO, they seem to have all the missing primitives needed for doing asyncio+threaded work, seems like a solid project.

maartenbreddels avatar Jan 25 '23 09:01 maartenbreddels

As a trio user I'm +:100: for using AnyIO :heart:

dhirschfeld avatar Jan 25 '23 09:01 dhirschfeld

I must point out that using blocking portals in this manner will block the original event loop, preventing any other tasks from running until the blocking call is complete. What should also be taken into account is any Protocols and tasks launched from the nested loop, which will also stop functioning once the blocking call completes.

As a former Java programmer, I recall that Java used a similar approach to solve this problem, but there the nested event loop was able to process events from the parent loop.

agronholm avatar Jan 25 '23 12:01 agronholm

Thanks for the feedback @agronholm. Actually this PR doesn't use blocking portals. Instead, I'm running a new event loop in a separate thread (for the control thread). So there is no nested loops, just two separate (parallel) loops, each in their own thread. There shouldn't be any issue with that, right? Except if there is a global state in AnyIO's event loop, which I doubt.

davidbrochart avatar Jan 25 '23 13:01 davidbrochart

I'm running a new event loop in a separate thread (for the control thread). So there is no nested loops, just two separate (parallel) loops, each in their own thread. There shouldn't be any issue with that, right?

Yeah, sorry, it's a rather large and complex PR so I haven't read through it, so I just made assumptions based on your other posts. So are you creating one-off event loops for running async callables from sync functions? Or do you have a lingering, shared event loop thread running in the background? Either way, thread safety is the biggest problem. One of the main selling points of async is that you can reliably determine where you can mutate shared objects safely (without explicit locks) since all other code is suspended until the next yield point. Those guarantees go out the window when two simultaneously running event loops share memory.

Except if there is a global state in AnyIO's event loop, which I doubt.

Technically there is no such thing as an AnyIO event loop :smile:

agronholm avatar Jan 25 '23 13:01 agronholm

Or do you have a lingering, shared event loop thread running in the background?

Yes, the control thread runs for the whole duration of the kernel process, so we're not creating lots of event loops. The control channel running in its own thread was already there before this PR, and we're taking care of thread safety. We need parallelism as the control channel should work even when executing user blocking code. At least this PR should not break things with regards to that.

davidbrochart avatar Jan 25 '23 13:01 davidbrochart

Just a quick note to say that this seems significant enough to mark it for IPykernel 7.

ccordoba12 avatar Jan 25 '23 17:01 ccordoba12

Yes, I agree. It also requires pyzmq v25.0 and jupyter-client v8.0, so that will be a good occasion.

davidbrochart avatar Jan 25 '23 17:01 davidbrochart

I'd like to merge this PR and update downstream projects in other PRs. Any feedback?

davidbrochart avatar Feb 15 '23 07:02 davidbrochart

Let's move forward with this. I'll merge soon if there are no objections.

davidbrochart avatar Mar 10 '23 13:03 davidbrochart

I think at least the local tests should all be passing before merging.

blink1073 avatar Mar 10 '23 17:03 blink1073

I think at least the local tests should all be passing before merging.

There is a bit of a chicken and egg situation though with respect to the downstream projects that are tested in the CI.

SylvainCorlay avatar Jul 12 '23 14:07 SylvainCorlay

There is a bit of a chicken and egg situation though with respect to the downstream projects that are tested in the CI.

No I mean explicitly the unit tests in ipykernel

blink1073 avatar Sep 14 '23 16:09 blink1073

Thanks so much for working on this PR Steve!

davidbrochart avatar Feb 19 '24 13:02 davidbrochart

Okay, unit tests are passing (except Windows). I think we should wait for #1210 and #1212 to land and be released, before considering main to be ipykernel 7. In the mean time, I'll write some migration docs here and try to address this portion:

"traceback when interrupting blocking user code leaks one frame into the kernel source code".

blink1073 avatar Feb 19 '24 16:02 blink1073

The fact that async interrupt isn't working on Windows, and we are failing on interrupt-related tests on the jupyter_client downstream test, leads me to think there is something not quite working about the new interrupt logic (the section in do_execute with execution.interrupt = await to_thread.run_sync(self.shell_interrupt.get). I don't know enough about AnyIO yet to debug properly.

blink1073 avatar Feb 21 '24 04:02 blink1073

Yes, the AnyIO documentation says that "Windows does not natively support signals so do not rely on this in a cross platform application". This is about signal handlers in general, not even related to AnyIO, so I don't think we can have async cell interrupt support on Windows. But in current ipykernel this is not even possible, on any platform, right? So this is still an improvement.

davidbrochart avatar Feb 22 '24 08:02 davidbrochart

No, it did work previously, this test was not skipped on Windows: https://github.com/ipython/ipykernel/blob/eddd3e666a82ebec287168b0da7cfa03639a3772/tests/test_async.py#L32. Additionally, the Jupyter Client test needs to pass before we merge, since it is treating the kernel as a process, and nothing we do here should break that contract. Also note that this downstream test is running on Ubuntu.

blink1073 avatar Feb 24 '24 15:02 blink1073

I haven't looked at this test, but with a fresh install of JupyterLab I cannot interrupt a top-level await on Linux:

https://github.com/ipython/ipykernel/assets/4711805/9331be71-2a0d-4f2c-8d4c-5509e03446e1

The CancelledError arrives after the cell has finished executing.

davidbrochart avatar Feb 24 '24 17:02 davidbrochart

Indeed, the test only tests that there is a CancelledError in the reply, but it doesn't test that it arrives before the cell has finished executing.

davidbrochart avatar Feb 24 '24 17:02 davidbrochart

Great, we can skip that test on Windows, and concentrate on getting client to pass.

blink1073 avatar Feb 24 '24 21:02 blink1073

Aha! The issue is that we need to update SignalTestKernel in jupyter_client, which is directly subclassing BaseKernel. I'd like to be able to update that class to prove it it still working, but I can't yet figure out why the interrupt is not working the same way.

blink1073 avatar Feb 25 '24 19:02 blink1073

More progress, I see now that we have to handle self.kernel.shell_interrupt.get now in SignalTestKernel.do_execute.

blink1073 avatar Feb 26 '24 01:02 blink1073

FWIW, I don't think we can use anyio on anything other than asyncio until they support polling on FDs, and pyzmq will continue to require tornado to run on asyncio on Windows Proactor for the foreseeable future, until/unless someone figures out how to poll on edge-triggered FDs for these event loops that have decided to remove basic functionality on Windows.

minrk avatar Mar 01 '24 10:03 minrk

FWIW, I don't think we can use anyio on anything other than asyncio until they support polling on FDs,

Are you sure you linked to the correct issue? I'm having some trouble understanding the connection. What is lacking FD polling?

agronholm avatar Mar 01 '24 11:03 agronholm

Are you sure you linked to the correct issue

ha, not at all, sorry. Fixed link: https://github.com/agronholm/anyio/issues/386

minrk avatar Mar 01 '24 21:03 minrk

That issue has a simple workaround: calling socket.fromfd(). What do you need that can't use that?

agronholm avatar Mar 01 '24 21:03 agronholm

That issue has a simple workaround: calling socket.fromfd(). What do you need that can't use that?

Unfortunately, that doesn't help here. The socket in question cannot be read from, it can only be polled for readability. It is an edge-triggered FD, used for signaling events to process on zmq sockets. trio's wait_readable works for this using IOCTL_AFD_POLL. You can also see https://github.com/python-trio/trio/issues/52 for loads of background.

minrk avatar Mar 01 '24 22:03 minrk

I'm afraid I don't know how to proceed. It seems as if the changes from #1210 are incompatible with the changes in this PR. I am getting deadlocked on reading the control loop.

blink1073 avatar Mar 09 '24 11:03 blink1073

What's really confusing to me is why nest_asyncio would show up in the stack trace:

 async with asyncio.Lock():
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/_pytest/runner.py", line 179, in pytest_runtest_teardown
    item.session._setupstate.teardown_exact(nextitem)
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/_pytest/runner.py", line 522, in teardown_exact
    fin()
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/_pytest/fixtures.py", line 677, in <lambda>
    subrequest.node.addfinalizer(lambda: fixturedef.finish(request=subrequest))
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/_pytest/fixtures.py", line 1022, in finish
    func()
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/_pytest/fixtures.py", line 909, in _teardown_yield_fixture
    next(it)
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/anyio/pytest_plugin.py", line 75, in wrapper
    yield from runner.run_asyncgen_fixture(func, kwargs)
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1940, in run_asyncgen_fixture
    self.get_loop().run_until_complete(
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/nest_asyncio.py", line 84, in run_until_complete
    self._run_once()
  File "/Users/silvester/workspace/.venvs/ipykernel/lib/python3.10/site-packages/nest_asyncio.py", line 107, in _run_once
    event_list = self._selector.select(timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/selectors.py", line 562, in select
    kev_list = self._selector.control(None, max_ev, timeout)

+++++++++++++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++++++++++++

blink1073 avatar Mar 09 '24 12:03 blink1073