dask-drmaa icon indicating copy to clipboard operation
dask-drmaa copied to clipboard

Launch Scheduler on remotely or on the local machine?

Open mrocklin opened this issue 9 years ago • 9 comments

Do we want to launch the dask-scheduler process on the cluster or do we want to keep it local?

Keeping the scheduler local simplifies things, but makes it harder for multiple users to share the same scheduler.

Launching the scheduler on the cluster is quite doable, but we'll need to learn on which machine the scheduler launched. I know how to do this through the SGE interface, but it's not clear to me that it is exposed through DRMAA. See this stackoverflow question. Alternatively we could pass the location of the scheduler back through some other means. This could be through a file on NFS (do we want to assume the presence of a shared filesystem?) or by setting up a tiny TCP Server to which the scheduler connects with its information.

The current implementation just launches the scheduler on the user's machine. Barring suggestions to the contrary my intention is to move forward with this approach until there is an obvious issue.

mrocklin avatar Oct 13 '16 17:10 mrocklin

This is a good question!

This is a personal preference, but in my mind, I think it's entirely reasonable to keep it local, which would usually mean a login node in a traditional HPC environment. If you were to launch the scheduler as a separate task, it would have some advantages as you note, but would be restricted in other ways (maximum walltime, etc.) Additionally, even though you could get a hostname, it's not actually guaranteed the hostname would get you the preferred network interface name (for instance, $hostname vs $hostname-ib0 or $hostname-10g or something like that.)

davidr avatar Oct 17 '16 15:10 davidr

On every HPC cluster I've ever used, users were restricted in long-running background tasks on login nodes. On some clusters, firewalls prevent connections to outside machines.

richardotis avatar Oct 21 '16 21:10 richardotis

In our research group (~20 people) we use a PBS cluster and all do the following:

We use ipyparallel to start both an ipcontroller at the headnode and ipengines on the nodes. Then we use a Jupyterhub on a different machine and connect the Client over ssh (with some custom scripts).

I am currently trying to figure out how we can use dask in our workflow. This package seems like a something that we could definitely use.

The ideal for us would be something like (running on the remote machine):

from dask_drmaa import DRMAACluster
cluster = DRMAACluster(ssh='sshserver')

I am very willing to help you to test this :)

basnijholt avatar Jan 16 '17 13:01 basnijholt

@basnijholt a simple way to use this would be the following:

ssh loginnode
dask-drmaa 20  # launch scheduler and 20 workers

mrocklin avatar Jan 20 '17 22:01 mrocklin

I know how to start the scheduler and the workers. The issue is to connect the client to the remote cluster.

I tried to tunnel the scheduler port, but that it's working.

basnijholt avatar Jan 21 '17 11:01 basnijholt

Generally I think that trying to help people get around local network policies is probably out of Dask's scope. I'm open to suggestions if you think there are general solutions to this, but I suspect that the right solution is "talk to your network administrator".

mrocklin avatar Jan 21 '17 12:01 mrocklin

I don't think our network needs anything special. I am able to tunnel ports and connect to an IPython Parallel cluster for example.

Maybe I did something incorrectly, but I only tunneled the scheduler port to the machine on which I run a notebook. Do I need to do something more, like manually transfer files?

basnijholt avatar Jan 22 '17 22:01 basnijholt

Workers will need to be able to run dask-worker. Dask-drmaa generally assumes that your worker machines have the same software environment as your login node. This conversation is getting a bit off topic from the original topic of this issue. If you have further questions I recommend opening a new issue.

mrocklin avatar Jan 23 '17 23:01 mrocklin

Figured I'd leave this note in case it helps someone even though this issue seems dormant.

FWIW have been successfully using dask-drmaa by doing the following.

  1. Logging in to the login node.
  2. Launching an interactive job.
  3. Starting the Jupyter notebook.
  4. Opening a two hop ssh tunnel to the notebook.
  5. Launching dask-drmaa as one of the steps in the notebook.

Did the same thing with ipyparallel previously and that also worked quite well.

jakirkham avatar Feb 24 '18 00:02 jakirkham