ompi icon indicating copy to clipboard operation
ompi copied to clipboard

`srun` with OpenMPI 5.0.3 unexpectedly launches MPI jobs as singletons without ERROR

Open GayatriManda opened this issue 3 weeks ago • 1 comments

As an OpenMPI user, I noticed unexpected behavior when running MPI programs with Slurm’s srun.

  • Environment: module load OpenMPI/5.0.3
  • What happens:

Using mpirun (works as expected)

Hello from proc 0 of 4
Hello from proc 1 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4

Using srun --mpi=pmi2

No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, 4 singletons will be started.
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1

Using plain srun (without explicitly mentioning --mpi)

Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
  • Why this is a problem: With --mpi=pmi2, OpenMPI atleast prints a runtime information before falling back to singleton mode. With plain srun, the same fallback happens but no warning is shown. As a user, this is very misleading: the job looks like a normal MPI run, but every process starts as a singleton rank 0 of 1, so no communication happens and resources are wasted.

  • What I would expect: It would be more helpful if OpenMPI issued an error or warning whenever it cannot connect to PMI/PMIx under srun, rather than silently launching singletons, and showed the same warning even when --mpi= is not explicitly specified.

This would prevent us from unintentionally running incorrect MPI jobs.

GayatriManda avatar Nov 19 '25 21:11 GayatriManda