dask-cuda icon indicating copy to clipboard operation
dask-cuda copied to clipboard

Start CUDAWorker without a Nanny

Open jacobtomlinson opened this issue 3 years ago • 15 comments

Currently dask-cuda-worker/CUDAWorker create one Nanny object per GPU.

In some HPC use cases Dask workers must be run without a Nanny as the Worker runs via multiprocessing and part of starting the second process causes MPI_Init() to be called a second time which PMI doesn't like.

Historically in dask-mpi we have provided a --no-nanny flag which uses the Worker class instead to the Nanny class. Recently we changed that for a --worker-class flag to allow you to specify Worker explicitly, but also allows you to specify CUDAWorker instead. However CUDAWorker is actually more of a CUDANanny and we experience the same problems we had before.

It would be useful if we could configure CUDAWorker to create one Worker per GPU instead of one Nanny to avoid this.

jacobtomlinson avatar Sep 24 '21 14:09 jacobtomlinson

The problem for that is we need to set appropriate environment variables (primarily CUDA_VISIBLE_DEVICES, but sometimes UCX variables too) before launching any process, and that's the main need for Nanny currently. One of the limitations is you can't spawn a Python process by specifying what environment variables are passed to it, but instead it will inherit the parent's process environment variables, we had a long discussion about that in https://github.com/dask/distributed/issues/3682 .

Now assuming we could ensure dask-mpi or whoever is launching a new CUDA worker (intentionally referring to "CUDA worker" not to confuse with the current Nanny-based CUDAWorker) would set every environment variable (also assuming I'm not forgetting any other details) by itself prior to spawning the process, then maybe it would be possible to have CUDA workers not depend on Nanny. However, this is probably a cumbersome case as the launcher would need to ensure everything is set exactly as CUDAWorker does today, which is already quite difficult to reproduce without breaking some of the various corner cases we handle today.

Unfortunately, I don't know much about MPI and dask-mpi to figure if we have a potentially better solution at this moment, but I'm reasonably confident what I wrote above holds true and there's not a much simpler path we can take.

pentschev avatar Sep 24 '21 14:09 pentschev

If removing the Nanny altogether isn't a good idea, just getting it to not do MPI_Init() in dask-mpi (but letting the Worker do it) might be OK?

rcthomas avatar Sep 24 '21 20:09 rcthomas

If removing the Nanny altogether isn't a good idea, just getting it to not do MPI_Init() in dask-mpi (but letting the Worker do it) might be OK?

Assuming (because I don't know much about dask-mpi and I can't find a direct call to MPI_Init()) this is possible, then yes, we could do exactly what Dask-CUDA does to delay the creation of CUDA context, something that happens as part of Nanny preload.

pentschev avatar Sep 24 '21 20:09 pentschev

My assumption here is that MPI_Init is called at import time for mpi4py. So when the worker process starts via multiprocessing it imports the same dependencies and results in the second call. But I could be wrong.

jacobtomlinson avatar Sep 27 '21 12:09 jacobtomlinson

My assumption here is that MPI_Init is called at import time for mpi4py. So when the worker process starts via multiprocessing it imports the same dependencies and results in the second call. But I could be wrong.

If that's the case, then we have to delay that import. We do something similar to that in Distributed with UCX, allowing us to parse proper Dask configs for UCX, and only then import/initialize UCX.

pentschev avatar Sep 27 '21 14:09 pentschev

@jacobtomlinson perhaps what we should do is experiment with modifying dask-mpi so that nanny processes do the MPI_Init() but worker processes don't (the opposite of what I suggested before). I think this is more right than what I suggested before, since the nanny may in principle at some point restart the worker, and I don't think we can trust MPI implementations to always know what to do with that.

The modification to dask-mpi would be to have the nanny set an environment variable like DASK_MPI_NO_MPI_INIT=1 and wrap the from mpi4py import MPI (the thing that actually ends up initializing MPI) in a check for that variable. This variable would just be for communicating from the nanny to the spawned worker process that it shouldn't do MPI_Init(). Normally it's unset so the nanny init's MPI, but the worker doesn't.

Unless I'm mistaken this could happen all inside dask-mpi.

rcthomas avatar Sep 28 '21 03:09 rcthomas

@rcthomas do you know where the call to MPI_Init() happens?

I am worried that we can't control the subsequent import of mpi4py. I don't think it happens because the worker imports it, I think it happens when the process is forked, all imported packages are imported again.

jacobtomlinson avatar Sep 28 '21 09:09 jacobtomlinson

Assuming the multiprocessing start method is spawn and not fork here... And this may be a little bit sloppy but...

import os
import sys

def f(name):
    print('hello', name)

if os.environ.get("NO_INIT"):
    print(f"child={os.getpid()}")
    assert "mpi4py" not in sys.modules
else:
    print(f"parent={os.getpid()}")
    from mpi4py import MPI
    import multiprocessing as mp
    mp.set_start_method('spawn')

    os.environ["NO_INIT"] = "1"

    comm = MPI.COMM_WORLD
    rank = comm.rank
    size = comm.size
    node = MPI.Get_processor_name()

    p = mp.Process(target=f, args=(node,))
    p.start()
    p.join()

    assert "mpi4py" in sys.modules

This worked for me, but maybe I'm looking at this wrong. If I threw in an mpi4py import in the child branch I got the PMI collision error. Understanding how the spawn method of multiprocessing start works might clarify things.

rcthomas avatar Sep 28 '21 14:09 rcthomas

By default Distributed/Dask-CUDA will spawn new processes, but forking is also possible and generally we should avoid it. The problem with imports is having them at the top of the "current script", each new process will indeed start by importing everything there which causes this kind of issue. The solution for that is generally to delay the import to the scope of the child process only. The sample above seems like a good start.

Now specifically talking about Dask-CUDA, what we can do is have the import along with any other MPI-related tasks in some function that we can pass as preload to the Nanny, it will then run on each new worker process (almost) before anything else. One thing I know runs before even preload is communications, which I once tried to reorder in https://github.com/dask/distributed/pull/3193 but ended up not doing so because of potential use cases that depend on that exact ordering. If MPI is supposed to interact directly with communications, then we may need to figure how to order things properly.

pentschev avatar Sep 28 '21 15:09 pentschev

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Nov 23 '21 20:11 github-actions[bot]

Since dask-scheduler + dask-cuda-worker has ended up working for all the scales where I think it matters right now, dask-mpi has dropped down the list of things I am worried about. I do like the convenience and responsiveness of the cluster we get with dask-mpi better, but that's the main thing we'd be getting here. It would be very nice if this could be considered for the future.

rcthomas avatar Nov 24 '21 15:11 rcthomas

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Dec 24 '21 17:12 github-actions[bot]

If we had the no-nanny option, at least for the CLI, users could then build workers manually

CUDA_VISIBLE_DEVICES=0 UCX_OPTION_FOO dask-cuda-worker ucx:// --no-nanny CUDA_VISIBLE_DEVICES=1 UCX_OPTION_FOO dask-cuda-worker ucx:// --no-nanny

quasiben avatar Jan 24 '22 18:01 quasiben

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Feb 23 '22 20:02 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar May 24 '22 21:05 github-actions[bot]