ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Getting stuck on MPI_FInalize() when using ULFM

Open rcoacci opened this issue 2 years ago • 3 comments

As discussed in the ULFM mailling list: https://groups.google.com/g/ulfm/c/2VRCwoEyj0M/m/0Dsf8OvZAAAJ

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Main branch at 68395556ce

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from git clone of main.

Please describe the system on which you are running

  • Operating system/version: CentOS 7
  • Network type: TCP over IB/Ethernet.

Details of the problem

I'm currently trying UFLM from OpenMPI from the main branch (specifically commit 68395556ce), and while running on a single node everything works fine, as soon as I add another node, the living processes gets stuck on MPI_Finalize().

The test program I'm using is a variant (with more printf's basically) of https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/02.err_handler.c.

The cluster in question is a production/development cluster that has Infiniband, GPU, and Ethernet but I didn't enable UCX on the openmpi install (leaving CUDA, as seen on ompi_info.txt), and it seems to be using tcp without problem. I tried forcing it to use the ethernet interface (via btl_tcp_if_include) but had the same results. I'm running it through the cluster slurm instalation (unfortunately its 20.11.9, as you probably know that's harder to change on a production cluster) using sbatch and the following mpirun commad line:

mpirun --with-ft ulfm --display-comm --display-comm-finalize  err_handler

The --display-comm parameters assure me that it's using tcp for communication between nodes;

After some more testing I found out that disabling shared memory (with --mca btl ^sm) makes the living processes exit (no one gets stuck at MPI_Finalize()) but the job never finishes and prted/prterun/srun processes keep running depending on the node.

So it seems there are maybe two issues here: one related to the sm component, and the other related to slurm/prted/prterun.

rcoacci avatar Feb 10 '23 18:02 rcoacci

Attaching ompi_info. ompi_info.txt

rcoacci avatar Feb 10 '23 18:02 rcoacci

FYI, the SC22 programs all work with these options, but only on ONE NODE:

OPTIONS="--with-ft ulfm --map-by :oversubscribe --mca btl tcp,self"

Could it be that UFLM only works on one node?

That wasn't my understanding, but I can find no examples anywhere that of anyone using ULFM on more than one node...

Perhaps someone from that project would take a look at this...

daa4453 avatar Mar 18 '24 21:03 daa4453

I should add that I tried --mca pml ob1 as well, with no change in behavior.

And I tried with 5.0.2 and nightly source drop from 3-14, which I think includes the PMIx/PRRTE update.

daa4453 avatar Mar 19 '24 17:03 daa4453