nbclient icon indicating copy to clipboard operation
nbclient copied to clipboard

Prevent file descriptor leak in util.just_run

Open mariokostelac opened this issue 3 years ago • 8 comments
trafficstars

While working with papermill I've found that repeated running of notebooks is eventually bumping into the limit of number of opened files.

Following he stacktrace led me to util.just_run function that creates asyncio loop in some cases, but doesn't close them, which leaves some eventpoll file descriptors opened forever.

This PR consists of two commits

  • The first commit creates a failing test, showing that the current implementation is leaving file descriptors opened after each run. On my machine (unix selectors), it leaves 3 FDs per just_run call. It also explicitly declares psutil as a dependency. It was already a transitive dependency, but I'm promoting to be explicit since I'm calling it directly.
  • The second commit fixes the issue by stopping and closing used loop, but just in cases where the method created the loop.

Happy to discuss the approach, missing context, and next steps to fix this bug.

mariokostelac avatar Jun 28 '22 15:06 mariokostelac

I see some type checks are failing. I'll probably need help to get around it. I've tried using MagicMock to swap loop object, but some code asserts that loop objects is AbstractEventLoop, which makes it fail.

mariokostelac avatar Jun 29 '22 06:06 mariokostelac

@davidbrochart I think this is in better shape now.

mariokostelac avatar Jun 29 '22 14:06 mariokostelac

Trigger CI.

davidbrochart avatar Jun 29 '22 14:06 davidbrochart

This is interesting. It seems that something in jupyter_client is now failing. I've taken a look at the stacktrace, it might be that https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#14 or https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#18 is getting our closed loop, and trying to run on it.

It's also strange that the function I'm changing looks remarkably similar to the one that's in the stacktrace. Why is that? @davidbrochart I see that you've authored both of them (at least parts).

I'll do some debugging tomorrow or on Friday. If you have some context that might help, I'm all ears.

Also sorry for not running the whole testsuite and delaying this bug discovery till CI run.

mariokostelac avatar Jun 29 '22 20:06 mariokostelac

@davidbrochart there's some discussion in https://github.com/jupyter/jupyter_client/pull/772#issuecomment-1171026460.

mariokostelac avatar Jun 30 '22 20:06 mariokostelac

I've tried hiding our loop from jupyter_client entirely by running asyncio.set_event_loop(None) after https://github.com/jupyter/nbclient/pull/237/files#diff-ce3eb5ea15b9f32af7e1dc032980287831900dc5538d389fc9875d7133c52220R63.

Unfortunately, https://github.com/jupyter/nbclient/blob/33b6e5fca02d6ede72643941fe0b8ec39d9c2437/nbclient/tests/test_client.py#L488 (and test after) then fail because kc.shutdown eventually tries to send something to control channel, which expect some running loop. I'm not 100% sure I can locate the bug. Is the bug in the test, running kc.shutdown outside of the loop?

mariokostelac avatar Jun 30 '22 20:06 mariokostelac

I've spent more time investigating this and I think this solution might be dangerous. It's changing the "current loop", and then stopping it. jupyter_client expects that the loop returned by get_event_loop is not closed, and there might be other libraries that expect the same. Shipping it could break client's processes in unexpected ways, and it might be very difficult for them to figure out why it happened.

There are several approaches we could take:

  1. At the end of just_run, if there wasn't a running loop, set it to none with set_event_loop(None). Drawback of that approach is that, prior to just_run call, there might had been a non-running (but open!) loop attached to the running thread. In that case, just_run would create a new loop, overwrite the previous loop, and nothing would close the old loop. That'd leak FDs as well.
  2. We could try harder to get some loop to run our coro, effectively do the same thing as https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#L18. In cases of getting a non-running, non-closed loop, we'd reuse it (by running again). That solves the leaking problem, and doesn't leave the system in some possibly broken state (closed loop attached to the thread).
  3. We could import jupyter_client.utils.run_sync method and use it directly, instead of reimplementing the same functionality.

Happy to hear your thoughts on these approaches, or new approaches entirely.

mariokostelac avatar Jul 01 '22 13:07 mariokostelac

I like the idea of using jupyter_client.utils.run_sync here, and fixing it properly upstream.

blink1073 avatar Jul 28 '22 15:07 blink1073