jupyter_client icon indicating copy to clipboard operation
jupyter_client copied to clipboard

Role of MultiKernelManager.shutdown_all

Open vidartf opened this issue 4 years ago • 6 comments

Working with the MultiKernelManager.shutdown_all I'm observing the following differences compared to looping over all kernels and shutting them down individually:

  • It doesn't seem to clean up the "ports in use" tracking.
  • It doesn't seem to stop the kernel restarter.

Given that, I'm wondering, is shutdown_all meant to be a end-of-life call only? I.e. no new kernels are meant to be started after a call to shutdown_all? If so, let's put that in the docstring. If not, let's make sure it cleans up the kernels properly :)

Another observation: With the AsynMultiKernelManager, it is possible to call shutdown_all while some kernels are still in the process of starting up. Currently, the kernels aren't considered for shutdown until after the startup has completed. This can cause these kernels to not be cleaned up properly. If shutdown_all is meant as a end-of-life clean-up step, that should probably be rectified as well.

vidartf avatar Feb 02 '21 11:02 vidartf

Hi @vidartf - your observations are correct and need to be addressed. Given that I see shutdown_all() only called from notebook/jupyter_server stop logic, I can only surmise this method is meant to be an end-of-life call. That said, I think we should treat it as if the server is meant to continue running and it's really meant to clear out any existing kernels.

The points you list are correct. Because shutdown_all() bypasses both the MultiKernelManager's shutdown (where the relatively recently added port caching was inserted and their cleanup occurs) and KernelManager's shutdown (where the restarter is stopped), it seems like it should be coded to simply call the respective MultiKernelManager's shutdown() method. I would still recommend we use now=False since we still want to give kernels a shot at graceful termination - especially those hosted in remote environments. The risk here is they not stop, but I've found shutdown(now=True) to not be any more reliable.

In fact, I don't think request_shutdown() or finish_shutdown() should even exist on MultiKernelManager. They only lead to confusion and I'd vote for their deprecation. They were added to make shutdown_all more "parallel" in that a pass is first made to request the shutdown, then another to terminate the kernel. When there are a number of kernels, this may give the kernel a bit more time to prepare for its shutdown. However, that additional time is not provided in the normal/individual shutdown sequence and one could argue both forms of shutdown should be consistent.

With the AsynMultiKernelManager, it is possible to call shutdown_all while some kernels are still in the process of starting up.

Yeah, that may be a tougher nut to crack since shutdown_all would need to lockout start requests as well as detect long-starting kernels (that even another pass through the list might not catch). I don't think this scenario is unique to AsyncMultiKernelManager - it should exist with MultiKernelManager as well - correct? (I suppose this is a function of how signals are handled in python since shutdown_all() is essentially called from a SIGINT handler.)

kevin-bates avatar Feb 02 '21 15:02 kevin-bates

I'll address the points above separately, but as a an observation related to #613 about tests hanging on Windows.

  • I have been seeing hanging pytest processes when working with the tests. Even Ctrl+C fails, so Ctrl+Break is needed to kill it (~SIGINT I think?).
  • It seems to occur if there are left over kernels when shutting down (I think only for the async case, but maybe not)
  • I notice that if the tests fail, the shutdown calls might not be made.
  • AFAI have been able to determine, the point of "hang" is the destroy() call for the ZMQ context. The docs are very explicit about making sure all ports are closed before calling that.

vidartf avatar Feb 02 '21 16:02 vidartf

[...] it seems like it should be coded to simply call the respective MultiKernelManager's shutdown() method.

I think the sync version was coded this way to allow all the kernels to shut down in parallel (as you mentioned), making the shutdown quicker. Switching the async version to this pattern should be less of a concern, but we might want the sync version to keep the current pattern? That said, keeping the patterns similar probably also has value. Though call.

With the AsynMultiKernelManager, it is possible to call shutdown_all while some kernels are still in the process of starting up.

Yeah, that may be a tougher nut to crack since shutdown_all would need to lockout start requests as well as detect long-starting kernels (that even another pass through the list might not catch).

I think it would be sufficient for the AsyncMKM to keep a collection of futures for the kernels that are not yet started, and then await those in shutdown_all:

+ fut = km.start_kernel(**kwargs)
+ self._starting_kernels[kernel_id] = fut
+ await fut
+ del self._starting_kernels[kernel_id]
- await km.start_kernel(**kwargs)
self._kernels[kernel_id] = km

I don't think this scenario is unique to AsyncMultiKernelManager - it should exist with MultiKernelManager as well - correct? (I suppose this is a function of how signals are handled in python since shutdown_all() is essentially called from a SIGINT handler.)

If you were threaded, possibly, but otherwise, it would be unique to Async:

async def my_construed_example():
    km = AsyncMultiKernelManager ()
    asyncio.create_task(km.start_kernel(...))
    await km.shutdown_all()

For sync, it would be impossible to enter the same path, as self._kernels will always be up to date when shutdown_all is called.

vidartf avatar Feb 02 '21 17:02 vidartf

shutdown_all would need to lockout start requests

This would only be true if shutdown_all was explicitly an end-of-life call, and you seemed to suggest it should not be.

vidartf avatar Feb 02 '21 17:02 vidartf

This would only be true if shutdown_all was explicitly an end-of-life call, and you seemed to suggest it should not be.

I'm only saying I don't think jupyter_client should dictate this be an EOL call. The one known use-case (where Notebook and jupyter-server are one) is that it is an EOL call, but should we treat it as such? I feel this should be an application-level decision and we focus on stopping all kernels we know about. (I like your idea on AsyncMKM tracking the starting kernels!)

I think the sync version was coded this way to allow all the kernels to shut down in parallel (as you mentioned), making the shutdown quicker. Switching the async version to this pattern should be less of a concern, but we might want the sync version to keep the current pattern? That said, keeping the patterns similar probably also has value.

Yes, the shutdown is quicker in that it probably shortens the wait during the finish_shutdown pass. However, given shutdown-all is not expected to be highly performant, I'm not sure saving a few hundred milliseconds is worth the maintenance burden this introduces.

For sync, it would be impossible to enter the same path, as self._kernels will always be up to date when shutdown_all is called.

Is this true if the signal handler were to start after the kernel had started but before its id is in the list? Perhaps I'm missing the nuances of python and signal handlers (Frankly, I probably am given it's just been a few years with this) but it seems possible to have a kernel running whose id is not reflected in the list during synchronous shutdown-all invocations, making a second pass necessary.

At any rate, I think we need to decide the best course of action here relative to {A}MKM request and finish shutdown.

My vote would be to deprecate the methods and have {A}MKM.shutdown_all call {A}MKM.shutdown for each kernel in the list and for AMKM wait for starting kernels, then AMKM.shutdown each (as you suggested). I think this beats the alternative to attempting to use {A}MKM request/finish and the reorganization that implies.

I'd also like to deprecate the non-async MKM and KM classes leaving only async, but that's a different discussion.

kevin-bates avatar Feb 02 '21 18:02 kevin-bates

My vote would be to deprecate the methods and have {A}MKM.shutdown_all call {A}MKM.shutdown for each kernel in the list and for AMKM wait for starting kernels, then AMKM.shutdown each (as you suggested). I think this beats the alternative to attempting to use {A}MKM request/finish and the reorganization that implies.

I'll try to get a PR with these changes for the AMKM this week.

vidartf avatar Feb 04 '21 11:02 vidartf