ipyparallel icon indicating copy to clipboard operation
ipyparallel copied to clipboard

More complex MPI split communicator layouts?

Open jhale opened this issue 8 years ago • 6 comments

At the moment we are manually launching a cluster with ipcontroller listening on an Infiniband over TCP/IP interface, and then launching 16 ipengine using say mpiexec -n 16 ipengine. All works very nicely and ZeroMQ across Infiniband seems more than adequate for orchestrating the cluster.

So we end up with 16 engines sharing a single MPI context. Within each engine we use MPI_COMM_SELF to run simulations so effectively we have 16 single process workers. This works really nicely, but we are now getting to the stage where we would like to run larger simulations that require a communicator larger than MPI_COMM_SELF. I do understand how to split up MPI_COMM_WORLD using mpi4py, but I don't quite see a way to only have, for example, only 2 ipengine running each with its own 8 process sub-communicator. Clearly doing the splitting the communicators without reducing the number of ipengines running is going to end with trouble.

Any pointers on an elegant way to achieve this?

jhale avatar Sep 05 '16 08:09 jhale

To clarify, I do have good reason to keep all workers within the same global MPI context, so the easy option of two mpiexec -n 8 ipengine and then using MPI_COMM_WORLD is an OK solution but not a great one.

jhale avatar Sep 05 '16 09:09 jhale

My only idea for the subcommunicators without starting up ipengine on every process is to use a custom startup script:

#!/usr/bin/env python
from mpi4py import MPI
world = MPI.COMM_WORLD
# setup sub comms/groups
world_group = world.Get_group()
N = 8 # number of processes per group
my_root = world.rank - (world.rank % N)
group = world_group.Incl(range(my_root, my_root + N))

if world.rank % N == 0: # group.rank == 0 should also work
    # start ipengine on group roots
    from ipyparallel.apps.ipengineapp import launch_new_instance
    launch_new_instance()
else:
    # start workers everywhere else
    launch_worker_process()

Then use

mpiexec -n 16 ipengine-or-worker

instead of ipengine.

I'm not sure how close that gets you to your goal.

minrk avatar Sep 05 '16 09:09 minrk

I think that's nearly it, thanks so much! I have one more question: what's the easiest way of implementing launch_worker_process()? I just want to have my MPI processes sitting there doing nothing until the ipengine starts receiving tasks. Do I need to implement my own subclass of IPEngineApp?

jhale avatar Sep 05 '16 11:09 jhale

I was hoping you would know that part :). The thing I don't know is how best to issue commands from your engine to its child workers so that they are properly coordinated when your Client only has a direct handle on the engine nodes. They shouldn't need to be Engines themselves, but they are going to need to execute at least the right MPI commands. How those are issued may depend on the application.

How much to you know a priori about what the worker is going to do? It may make sense to instantiate more of the shared objects prior to launching the engine and workers.

If the workers should execute the same code as their engine, then perhaps leaving everything as an engine and creating sub-views would be the easiest way to go. This notebook gives an example of creating sub-views based on the Groups, so you can do lock-step semi-global operations on subsets of engines.

minrk avatar Sep 05 '16 12:09 minrk

Yes in my case the workers will execute the same code as their engines so perhaps taking a view on all engines makes most sense. I don't need to do any 'coordination' from the engines to the workers, I think it should all just come out naturally from the split MPI communicator.

I'll give your second suggestion a go, thanks again for all of your help.

jhale avatar Sep 05 '16 12:09 jhale

What are possible issues running mpi/ipengines on each processes with 3 nodes and 32 cores on each node ?

evolvingfridge avatar Feb 25 '17 17:02 evolvingfridge