nbclient
nbclient copied to clipboard
Prevent file descriptor leak in util.just_run
While working with papermill I've found that repeated running of notebooks is eventually bumping into the limit of number of opened files.
Following he stacktrace led me to util.just_run function that creates asyncio loop in some cases, but doesn't close them, which leaves some eventpoll file descriptors opened forever.
This PR consists of two commits
- The first commit creates a failing test, showing that the current implementation is leaving file descriptors opened after each run. On my machine (unix selectors), it leaves 3 FDs per
just_runcall. It also explicitly declarespsutilas a dependency. It was already a transitive dependency, but I'm promoting to be explicit since I'm calling it directly. - The second commit fixes the issue by stopping and closing used loop, but just in cases where the method created the loop.
Happy to discuss the approach, missing context, and next steps to fix this bug.
I see some type checks are failing. I'll probably need help to get around it. I've tried using MagicMock to swap loop object, but some code asserts that loop objects is AbstractEventLoop, which makes it fail.
@davidbrochart I think this is in better shape now.
Trigger CI.
This is interesting. It seems that something in jupyter_client is now failing.
I've taken a look at the stacktrace, it might be that https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#14 or https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#18 is getting our closed loop, and trying to run on it.
It's also strange that the function I'm changing looks remarkably similar to the one that's in the stacktrace. Why is that? @davidbrochart I see that you've authored both of them (at least parts).
I'll do some debugging tomorrow or on Friday. If you have some context that might help, I'm all ears.
Also sorry for not running the whole testsuite and delaying this bug discovery till CI run.
@davidbrochart there's some discussion in https://github.com/jupyter/jupyter_client/pull/772#issuecomment-1171026460.
I've tried hiding our loop from jupyter_client entirely by running asyncio.set_event_loop(None) after https://github.com/jupyter/nbclient/pull/237/files#diff-ce3eb5ea15b9f32af7e1dc032980287831900dc5538d389fc9875d7133c52220R63.
Unfortunately, https://github.com/jupyter/nbclient/blob/33b6e5fca02d6ede72643941fe0b8ec39d9c2437/nbclient/tests/test_client.py#L488 (and test after) then fail because kc.shutdown eventually tries to send something to control channel, which expect some running loop.
I'm not 100% sure I can locate the bug. Is the bug in the test, running kc.shutdown outside of the loop?
I've spent more time investigating this and I think this solution might be dangerous. It's changing the "current loop", and then stopping it. jupyter_client expects that the loop returned by get_event_loop is not closed, and there might be other libraries that expect the same. Shipping it could break client's processes in unexpected ways, and it might be very difficult for them to figure out why it happened.
There are several approaches we could take:
- At the end of
just_run, if there wasn't a running loop, set it to none withset_event_loop(None). Drawback of that approach is that, prior tojust_runcall, there might had been a non-running (but open!) loop attached to the running thread. In that case,just_runwould create a new loop, overwrite the previous loop, and nothing would close the old loop. That'd leak FDs as well. - We could try harder to get some loop to run our coro, effectively do the same thing as https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#L18. In cases of getting a non-running, non-closed loop, we'd reuse it (by running again). That solves the leaking problem, and doesn't leave the system in some possibly broken state (closed loop attached to the thread).
- We could import
jupyter_client.utils.run_syncmethod and use it directly, instead of reimplementing the same functionality.
Happy to hear your thoughts on these approaches, or new approaches entirely.
I like the idea of using jupyter_client.utils.run_sync here, and fixing it properly upstream.