dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

dask-jobqueue binder

Open lesteve opened this issue 6 years ago • 6 comments

The idea is to have a binder setup with a toy cluster so that people can play with dask-jobqueue a bit without having to set it up on their cluster.

Our SLURM CI setup uses a single Dockerfile, maybe this image could be used to have a binder.

Binder allows you to use a Dockerfile: https://github.com/binder-examples/minimal-dockerfile

Questions:

  • how does this idea work in practice. Is 1-2GB RAM enough for a toy cluster ?
  • if I use binder.pangeo.io does it work better (there seems to be more RAM on pangeo.io?)

If this idea works, we could think about what kind of notebooks to create (related to #253).

lesteve avatar May 24 '19 07:05 lesteve

I am going to try to do this and see how far I can push it.

lesteve avatar May 24 '19 08:05 lesteve

I have some proof of concept here: https://github.com/lesteve/test-binder

Here is the binder link: https://mybinder.org/v2/gh/lesteve/test-binder/master

For now there is a single notebook simple.ipynb. Comments more than welcome @willirath @guillaumeeb!

Full disclosure: I have seen some sporadic problems with the processes supervised by supervisord (mostly mysqld does not start correctly for some reason I have not yet figured out ...). I think we can probably use some work-around for this.

lesteve avatar May 28 '19 12:05 lesteve

Thanks @lesteve! This is nice!

I had trouble making the binder start, I needed to launch it 4 times... Don't know why. Then I have the mysqldb daemon not started, but thanks to your first cell I could start it easily.

I think the idea works and the RAM may not be a limitation for some simple examples. There may be more on Pangeo binder, but not sure this will make a big difference if we don"t use separated pods for the workers.

The first question that came to my mind then is : how using SlurmCluster is different from LocalCluster. That's the beauty of Dask, just change LocalCluster with SlurmCluster and the rest of the code is the same. What specific example can we set up for dask-jobqueue?

  • Is LocalCluster able to use adaptive logic?
  • Should we show the different args specific to a job queuing system, like local-directory or the memory resources?
  • Should we add some HPC-like example : Montecarlo simulation, like Pi computation?

guillaumeeb avatar Jun 08 '19 20:06 guillaumeeb

I had another look at this, I tweaked it a bit, and it looks like this is working better than my last attempt (not sure why ...). So maybe worth revisiting?

https://mybinder.org/v2/gh/lesteve/test-binder/master?filepath=simple.ipynb

For me the main point would be a quick intro into dask-jobqueue:

  • creating the cluster + client
  • cluster.scale
  • simple example of Dask Dataframe, delayed, and futures
  • cluster.job_script
  • look at the logs created by the workers
  • mention the dashboard
  • mention the different things to tweak the submission script, queue, walltime, job_extra, env_extra, etc ...
  • refer to Dask documentation for more details on Dask, mentioning that SLURMCluster and LocalCluster can be replaced by each other
  • refer to their local cluster doc for more details
  • maybe more stuff that I have missed

Comments more than welcome!

lesteve avatar Feb 28 '20 16:02 lesteve

That's great news! I'll have a look.

willirath avatar Mar 02 '20 10:03 willirath

Looks a lot more stable. Scaling the cluster up and down doesn't seem to break the Slurm scheduler anymore.

willirath avatar Mar 02 '20 10:03 willirath