dask-cuda
dask-cuda copied to clipboard
Start CUDAWorker without a Nanny
Currently dask-cuda-worker
/CUDAWorker
create one Nanny
object per GPU.
In some HPC use cases Dask workers must be run without a Nanny
as the Worker
runs via multiprocessing and part of starting the second process causes MPI_Init()
to be called a second time which PMI doesn't like.
Historically in dask-mpi
we have provided a --no-nanny
flag which uses the Worker
class instead to the Nanny
class. Recently we changed that for a --worker-class
flag to allow you to specify Worker
explicitly, but also allows you to specify CUDAWorker
instead. However CUDAWorker
is actually more of a CUDANanny
and we experience the same problems we had before.
It would be useful if we could configure CUDAWorker
to create one Worker
per GPU instead of one Nanny
to avoid this.
The problem for that is we need to set appropriate environment variables (primarily CUDA_VISIBLE_DEVICES
, but sometimes UCX variables too) before launching any process, and that's the main need for Nanny
currently. One of the limitations is you can't spawn a Python process by specifying what environment variables are passed to it, but instead it will inherit the parent's process environment variables, we had a long discussion about that in https://github.com/dask/distributed/issues/3682 .
Now assuming we could ensure dask-mpi
or whoever is launching a new CUDA worker (intentionally referring to "CUDA worker" not to confuse with the current Nanny
-based CUDAWorker
) would set every environment variable (also assuming I'm not forgetting any other details) by itself prior to spawning the process, then maybe it would be possible to have CUDA workers not depend on Nanny
. However, this is probably a cumbersome case as the launcher would need to ensure everything is set exactly as CUDAWorker
does today, which is already quite difficult to reproduce without breaking some of the various corner cases we handle today.
Unfortunately, I don't know much about MPI and dask-mpi
to figure if we have a potentially better solution at this moment, but I'm reasonably confident what I wrote above holds true and there's not a much simpler path we can take.
If removing the Nanny
altogether isn't a good idea, just getting it to not do MPI_Init()
in dask-mpi
(but letting the Worker
do it) might be OK?
If removing the
Nanny
altogether isn't a good idea, just getting it to not doMPI_Init()
indask-mpi
(but letting theWorker
do it) might be OK?
Assuming (because I don't know much about dask-mpi
and I can't find a direct call to MPI_Init()
) this is possible, then yes, we could do exactly what Dask-CUDA does to delay the creation of CUDA context, something that happens as part of Nanny
preload.
My assumption here is that MPI_Init
is called at import time for mpi4py
. So when the worker process starts via multiprocessing it imports the same dependencies and results in the second call. But I could be wrong.
My assumption here is that
MPI_Init
is called at import time formpi4py
. So when the worker process starts via multiprocessing it imports the same dependencies and results in the second call. But I could be wrong.
If that's the case, then we have to delay that import. We do something similar to that in Distributed with UCX, allowing us to parse proper Dask configs for UCX, and only then import/initialize UCX.
@jacobtomlinson perhaps what we should do is experiment with modifying dask-mpi
so that nanny processes do the MPI_Init()
but worker processes don't (the opposite of what I suggested before). I think this is more right than what I suggested before, since the nanny may in principle at some point restart the worker, and I don't think we can trust MPI implementations to always know what to do with that.
The modification to dask-mpi
would be to have the nanny set an environment variable like DASK_MPI_NO_MPI_INIT=1
and wrap the from mpi4py import MPI
(the thing that actually ends up initializing MPI) in a check for that variable. This variable would just be for communicating from the nanny to the spawned worker process that it shouldn't do MPI_Init()
. Normally it's unset so the nanny init's MPI, but the worker doesn't.
Unless I'm mistaken this could happen all inside dask-mpi
.
@rcthomas do you know where the call to MPI_Init()
happens?
I am worried that we can't control the subsequent import of mpi4py
. I don't think it happens because the worker imports it, I think it happens when the process is forked, all imported packages are imported again.
Assuming the multiprocessing start method is spawn
and not fork
here... And this may be a little bit sloppy but...
import os
import sys
def f(name):
print('hello', name)
if os.environ.get("NO_INIT"):
print(f"child={os.getpid()}")
assert "mpi4py" not in sys.modules
else:
print(f"parent={os.getpid()}")
from mpi4py import MPI
import multiprocessing as mp
mp.set_start_method('spawn')
os.environ["NO_INIT"] = "1"
comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size
node = MPI.Get_processor_name()
p = mp.Process(target=f, args=(node,))
p.start()
p.join()
assert "mpi4py" in sys.modules
This worked for me, but maybe I'm looking at this wrong. If I threw in an mpi4py import in the child branch I got the PMI collision error. Understanding how the spawn method of multiprocessing start works might clarify things.
By default Distributed/Dask-CUDA will spawn new processes, but forking is also possible and generally we should avoid it. The problem with imports is having them at the top of the "current script", each new process will indeed start by importing everything there which causes this kind of issue. The solution for that is generally to delay the import to the scope of the child process only. The sample above seems like a good start.
Now specifically talking about Dask-CUDA, what we can do is have the import along with any other MPI-related tasks in some function that we can pass as preload
to the Nanny
, it will then run on each new worker process (almost) before anything else. One thing I know runs before even preload
is communications, which I once tried to reorder in https://github.com/dask/distributed/pull/3193 but ended up not doing so because of potential use cases that depend on that exact ordering. If MPI is supposed to interact directly with communications, then we may need to figure how to order things properly.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Since dask-scheduler
+ dask-cuda-worker
has ended up working for all the scales where I think it matters right now, dask-mpi
has dropped down the list of things I am worried about. I do like the convenience and responsiveness of the cluster we get with dask-mpi
better, but that's the main thing we'd be getting here. It would be very nice if this could be considered for the future.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
If we had the no-nanny option, at least for the CLI, users could then build workers manually
CUDA_VISIBLE_DEVICES=0 UCX_OPTION_FOO dask-cuda-worker ucx:// --no-nanny CUDA_VISIBLE_DEVICES=1 UCX_OPTION_FOO dask-cuda-worker ucx:// --no-nanny
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.