ompi
ompi copied to clipboard
MPI_Comm_spawn() giving unreachable errors across nodes
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.0.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from source tarball using gcc
Please describe the system on which you are running
- Operating system/version: RHEL 7.9
- Computer hardware: Intel Xeon E5-2670
- Network type: Infiniband
Details of the problem
I have a master and worker setup, where the master spawns some number of workers using MPI_Comm_spawn. In practice, this is dynamic and partially determined by the master script at runtime, so I can't just spawn all the jobs via mpirun.
master.cpp
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
int NUM_JOBS = 10;
MPI_Init(&argc, &argv);
MPI_Info mpi_info;
MPI_Info_create(&mpi_info);
MPI_Info_set(mpi_info, "hostfile", "nodefile");
MPI_Comm child_comm;
MPI_Comm_spawn("worker", MPI_ARGV_NULL, NUM_JOBS, mpi_info, 0, MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
doStuff();
MPI_Finalize();
return 0;
}
worker.cpp
#include <mpi.h>
void doStuff() {
// does stuff
}
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
doStuff();
MPI_Finalize();
return 0;
}
The system has 24 cores per node, and I have 5 nodes assigned to me. I've created a hostfile with 101 lines, so I could theoretically run a master and 100 workers.
Just running 101 workers without a master straight up works perfectly fine. mpirun is able to spawn the processes just fine, although without a master they sit around doing nothing.
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 101 --mca btl tcp,openib,self --hostfile nodefile worker
Running a master with 10 workers (as hard coded in the example above) also works fine. Note that because I have 24 cores per node, all 11 of these processes are on the same node.
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx4_0:1"
mpirun -n 1 --mca btl tcp,openib,self --hostfile nodefile master
However, if I up NUM_JOBS in master.cpp to 100 and run the same script, it fails. I get an error message like this:
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[53646,2],88]) is on host: mymachine04
Process 2 ([[53636,1],0]) is on host: unknown!
BTLs attempted: self openib tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[mymachine04:22480] [[53646,2],88] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
...
[mymachine01:124495] 76 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[mymachine01:124495] Set MCA parameter orte_base_help_aggregate" to 0 to see all help / error messages
[mymachine01:124495] 76 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[mymachine01:124495] 76 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
That "ORTE_ERROR_LOG: Unreachable" line is repeated a total of 77 times, pointing to each of the 77 the jobs on other nodes.
Note that I can freely ssh between nodes, and as mentioned above I can run 101 nodes just fine if I'm using mpirun. Any ideas why they wouldn't be able to communicate across nodes when I use MPI_Comm_spawn?