Integration with ClusterManagers.jl
Just a quick question: will there be some way to use MPI as a transport for the julia parallel stuff (as enabled by this package) in combination with for example the SLURM cluster manager from ClusterManagers.jl?
I guess that in general, the capability of MPI as a ClusterManager should be separate from MPI as transport.
Yes, but the current design doesn't look like that, right?
No you are right, the transport layer and the cluster manager layer should be decoupled. We talked about this with @amitmurthy recently, it looks like everyone is in agreement. Hopefully the parallel stuff can be made more modular in upcoming releases (> v0.4).
What we are looking for here I think is
- slurm to only allocate compute nodes (and not launch any julia workers)
- write node names to a hostfile
- use mpirun to launch a mpi job on these nodes
We probably don't even need the cluster manager if the MPI job only has MPI calls (i.e., no julia parallel API calls).
I haven't used slurm, but a cursory round of googling brought up https://computing.llnl.gov/linux/slurm/salloc.html which could be used for the allocation and the command could be a bash script that does the following on each node:
- append node information to a hostfile on a shared filesystem
- once all the entries are added, one of the nodes, say the one with the lowest ip-address, executes
mpirunwith this hostfile as input and julia as the command. If you are using julia parallel calls (not just the MPI API), run julia using MPIManager from MPI.jl to call mpirun and launch the workers. - other nodes wait for the job to finish, say by testing for the presence of a "job-complete" file again on the shared filesystem.
I have to admit I'm really new to this and don't really understand all aspects of this. I'm using a Berkeley cluster, and following these guidelines. Essentially I wrote a .sh script with the SLURMinfo at the top as comments, and then an mpirun julia somescript.jl in it. I then called sbatch with the name of that script as an argument. SLURM then magically allocated nodes, and executed that script on each node.
That seemed to work for direct MPI calls, but I have way too little understanding how the whole scenario with MPI as a transport layer for julia parallel stuff works.
This is a bit old, but I am interested in this and have a lot of experience with SLURM (the cluster at work, which I support for users, uses it). In modern MPI implementations, MPI is aware of the cluster manager. In a default MPI install, without linking in the CM's Process Management Interface ( i.e. libpmi ) libraries, MPI can still directly pull things such as the number of tasks and the allocated hosts from the CM without any extra work. There are advanced features available to all newer versions of libraries like MPICH, MVAPICH, and OpenMPI as well, which require linking in libpmi or libpmi2 when building the MPI library. This is explored in depth here