ipyparallel
ipyparallel copied to clipboard
More complex MPI split communicator layouts?
At the moment we are manually launching a cluster with ipcontroller
listening on an Infiniband over TCP/IP interface, and then launching 16 ipengine
using say mpiexec -n 16 ipengine
. All works very nicely and ZeroMQ across Infiniband seems more than adequate for orchestrating the cluster.
So we end up with 16 engines sharing a single MPI context. Within each engine we use MPI_COMM_SELF
to run simulations so effectively we have 16 single process workers. This works really nicely, but we are now getting to the stage where we would like to run larger simulations that require a communicator larger than MPI_COMM_SELF
. I do understand how to split up MPI_COMM_WORLD
using mpi4py
, but I don't quite see a way to only have, for example, only 2 ipengine
running each with its own 8 process sub-communicator. Clearly doing the splitting the communicators without reducing the number of ipengine
s running is going to end with trouble.
Any pointers on an elegant way to achieve this?
To clarify, I do have good reason to keep all workers within the same global MPI context, so the easy option of two mpiexec -n 8 ipengine
and then using MPI_COMM_WORLD
is an OK solution but not a great one.
My only idea for the subcommunicators without starting up ipengine on every process is to use a custom startup script:
#!/usr/bin/env python
from mpi4py import MPI
world = MPI.COMM_WORLD
# setup sub comms/groups
world_group = world.Get_group()
N = 8 # number of processes per group
my_root = world.rank - (world.rank % N)
group = world_group.Incl(range(my_root, my_root + N))
if world.rank % N == 0: # group.rank == 0 should also work
# start ipengine on group roots
from ipyparallel.apps.ipengineapp import launch_new_instance
launch_new_instance()
else:
# start workers everywhere else
launch_worker_process()
Then use
mpiexec -n 16 ipengine-or-worker
instead of ipengine
.
I'm not sure how close that gets you to your goal.
I think that's nearly it, thanks so much! I have one more question: what's the easiest way of implementing launch_worker_process()
? I just want to have my MPI processes sitting there doing nothing until the ipengine
starts receiving tasks. Do I need to implement my own subclass of IPEngineApp
?
I was hoping you would know that part :). The thing I don't know is how best to issue commands from your engine to its child workers so that they are properly coordinated when your Client only has a direct handle on the engine nodes. They shouldn't need to be Engines themselves, but they are going to need to execute at least the right MPI commands. How those are issued may depend on the application.
How much to you know a priori about what the worker is going to do? It may make sense to instantiate more of the shared objects prior to launching the engine and workers.
If the workers should execute the same code as their engine, then perhaps leaving everything as an engine and creating sub-views would be the easiest way to go. This notebook gives an example of creating sub-views based on the Groups, so you can do lock-step semi-global operations on subsets of engines.
Yes in my case the workers will execute the same code as their engines so perhaps taking a view on all engines makes most sense. I don't need to do any 'coordination' from the engines to the workers, I think it should all just come out naturally from the split MPI communicator.
I'll give your second suggestion a go, thanks again for all of your help.
What are possible issues running mpi/ipengines on each processes with 3 nodes and 32 cores on each node ?