dask-jobqueue
dask-jobqueue copied to clipboard
Make job submission asynchronous
I have noticed that execution of commands (e.g., condor_submit for the condor backend) appear to be synchronous. In fact, there's a small note about this in the code itself:
https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/core.py#L305
We've started to notice this particularly at very busy batch schedulers. For example, when dask labextension (https://github.com/dask/dask-labextension) is used in a Jupyter notebook, it will spawn the Dask scheduler inside the jupyter hub process (I think I got this terminology right?) and not the notebook. Because it's in the hub itself, if dask jobqueue is non-responsive then the entire UI freezes (as no I/O is done in the event loop). This triggers user complaints of "Jupyter stops working when we use Dask".
The impact of the blocking behavior can be easily seen by replacing the submit executable with a shell script that does a sleep 20 before invoking the real submit executable.
For what it's worth, this should be pretty trivial to make async using asyncio.create_subprocess_exec. Its API is nearly identical to using subprocess.Popen.
Or to not have to change anything else using the _call method with something like:
async def _submit_job(self, script_filename):
return await asyncio.get_running_loop().run_in_executor(None, self._call(...))
@bbockelm I tested both solutions proposed by @mivade in our environment, but I still see some non-responsiveness (I will investigate).
it will spawn the Dask scheduler inside the jupyter hub process (I think I got this terminology right?)
For the terminology, it's on the Jupyterlab UI process (the notebook server) that dask-labextension will run, and not in the Kernel (another process were the code gets executed, e.g. the notebook cells).
Because it's in the hub itself, if dask jobqueue is non-responsive then the entire UI freezes (as no I/O is done in the event loop)
So if I understand, the condor_submit command is taking time to run, and so it block all Jupyterlab UI through dask-labextension.
An easy solution is stop using jupyterlab-extension for starting Dask cluster for the time being 😄. I understand this can be seen as a regression for users... For my part I've never use dask-labextension to launch Dask clusters on our job scheduling system, I'm always doing it inside a notebook cell (so the Kernel). I only use the extension to watch my computations.
I also think that this might be a Condor issue (job submission should be almost immediate in job queueing systems), or that this can be handled in dask-labextention maybe?
Anyway, if you find a simple way to make things asynchronous here, this would be welcomed too!
Related to #567
I am experiencing a similar issue as described here. In fact, my workers actually exit because the main process is hanging for so long, all because it's busy waiting for condor_submit to exit.
The suggestion from @mivade of using run_in_executor fixes the issue for me and submission now is amazingly fast. The exact code in core.py for _submit_job looks like this with the fix
async def _submit_job(self, script_filename):
return await asyncio.get_running_loop().run_in_executor(None, self._call, shlex.split(self.submit_command) + [script_filename])
Would love to see this changed!
Well, I know almost nothing in asyncio stuff. I think we should make dask-jobqueue more compatible with it, but I'm also not sure we cannot make it just by adding small changes like this, or can we?
cc @jacobtomlinson.
@jrueb it would be great to see a PR with this change. If self._call hangs for a long time with blocking IO it makes sense to run it in an executor.
Okay, I will look into it and make a PR once I got a satisfying solution. Will also be interesting to see why the last PR for this was never finished.