ompi Up the submodule pointers for PMIx and PRRTE

Test against OMPI CI

Feb 20 '24 16:02 rhc54

Thanks for the PR. I'm running AWS CI.

Feb 20 '24 17:02 wenduwan

This PR failed AWS internal CI. Seeing a lot of failures

mpirun --wdir . -n 72 --hostfile hostfile --map-by ppr:36:node --timeout 1800 -x PATH  mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Scatterv -npmin 72 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn36.txt
INFO     root:utils.py:507 mpirun output:
--------------------------------------------------------------------------
It looks like MPI runtime init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during RTE init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  num local peers
  --> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
#x00*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)
[ip-172-31-22-21:73523] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
.....

Feb 22 '24 19:02 wenduwan

I'm running it again.

Feb 22 '24 19:02 wenduwan

No immediate ideas - it is working fine for me, and obviously passed all your standard CI's. I'd have to know more about the particular setup to provide suggestions on what you might try.

Feb 22 '24 21:02 rhc54

@wenduwan Any update on this? I'm still unable to reproduce any problems.

Feb 27 '24 15:02 rhc54

I ran our tests again with many failures as shown above. I haven't got a chance to look into that yet.

A quick glance shows --enable-debug fixes those failures.

Feb 27 '24 15:02 wenduwan

Finally I got some time to look into this.

The issue happens on 2 nodes during MPI_Init

...
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: components_open: component tcp open function successful
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component self
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component self returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component smcuda returned failure
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: component smcuda closed
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: unloading component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component ofi
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component ofi returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component sm returned failure
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: component sm closed
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: unloading component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component tcp
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl:tcp: 0xebbde0: if eth0 kidx 2 cnt 0 addr 172.31.12.182 IPv4 bw 100 lt 100
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl: tcp: exchange: 0 2 IPv4 172.31.12.182
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component tcp returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component ofi returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component sm
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component sm returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component tcp
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl:tcp: 0x26138f0: if eth0 kidx 2 cnt 0 addr 172.31.4.77 IPv4 bw 100 lt 100
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl: tcp: exchange: 0 2 IPv4 172.31.4.77
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component tcp returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: bml: Using self btl for send to [[54950,1],0] on node ip-172-31-12-182
[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml �
                                                                                                                                  �                                                                                                 �
                                                                                                                                  �
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-172-31-4-77:00000] *** An error occurred in MPI_Init
[ip-172-31-4-77:00000] *** reported by process [3601203201,1]
[ip-172-31-4-77:00000] *** on a NULL communicator
[ip-172-31-4-77:00000] *** Unknown error
[ip-172-31-4-77:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-4-77:00000] ***    and MPI will try to terminate your MPI job as well)

Feb 28 '24 21:02 wenduwan

Ignore

Feb 28 '24 21:02 wenduwan

@wenduwan Pushed the latest state of the master branches - been a number of fixes since this was originally created.

Converted to "draft" to ensure nobody merges this by mistake

Feb 28 '24 23:02 rhc54

Unfortunately the same tests are still failing...

Feb 29 '24 01:02 wenduwan

Afraid you aren't giving me much to work off of here 😞 I did see this from your printout above:

[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml <trash>

You mentioned that --enable-debug made the problems go away? If so, that is suspiciously like what we see when someone mixes debug with non-debug libraries across nodes. The pack/unpack pairing gets off and things go haywire.

Feb 29 '24 02:02 rhc54

@rhc54 In our CI we build applications separately for mpi with debug vs non-debug so this shouldn't be an issue.

@hppritcha I wonder if someone on your side could quickly verify this PR with OMB/IMB on 2 nodes?

Feb 29 '24 02:02 wenduwan

You don't need to run a bloody OMB test when things are failing in MPI_Init - just run MPI "hello". I can reproduce it with --disable-debug, but the error appears to be in the pml "checker" logic. Might be the first thing pulled up from the modex, so I'll take a look there.

Feb 29 '24 03:02 rhc54

Okay, I tracked it down and fixed it. Hopefully okay now!

Feb 29 '24 12:02 rhc54

Thank you Ralph. The test has passed.

Feb 29 '24 17:02 wenduwan

Hooray! Anyone have an idea on what mpi4py is complaining about?

Feb 29 '24 18:02 rhc54

probably related to #12384

Feb 29 '24 19:02 hppritcha

I've got the alternative "spawn" code working, but the MPI message between the parent and a child process seems to be hanging or isn't getting thru. I checked the modex recv calls and the btl/tcp connection info is all getting correctly transferred between all the procs (both parent and child). So I'm a little stumped.

Is there a simple way to trace the MPI send/recv procedure to see where the hang might be? Since the modex gets completed, I'm thinking that maybe the communicator isn't being fully constructed (since I eliminated the "nextcid" code), or perhaps the communicator doesn't have consistent ordering of procs in it.

Mar 02 '24 00:03 rhc54

where is the alternate spawn code? I can take a look at this later this week to see about what's going wrong.

Mar 04 '24 14:03 hppritcha

It is in the topic/dpm branch of my OMPI fork: https://[email protected]/rhc54/ompi. Here is the commit message:

Add a second method for doing connect/accept

    The "old" method relies on PMIx publish/lookup followed by
    a call to PMIx_Connect. It then does a "next cid" method
    to get the next communicator ID, which has multiple algorithms
    including one that calls PMIx_Group.

    Simplify this by using PMIx_Group_construct in place of
    PMIx_Connect, and have it return the next communicator ID.
    This is more scalable and reliable than the prior method.

    Retain the "old" method for now as this is new code. Create
    a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
    between the two approaches. Default it to "true" for now
    for ease of debugging.

    NOTE: this includes an update to the submodule pointers
    for PMIx and PRRTE to obtain the required updates to
    those code bases.

Everything works fine, but the child rank=0 hangs in the MPI_Recv call waiting to get the message from the parent. I can't find the cause of the hang. I've printed out the contents of the communicators and they look fine, and I've checked that we aren't waiting for connection endpts (at least, I'm not seeing it).

Help is appreciated! Minus the message, this runs through thousands of comm_spawn loops without a problem.

Mar 04 '24 15:03 rhc54

Updated the submodule pointers to track PMIx/PRRTE master changes

Mar 04 '24 15:03 rhc54

@hppritcha The branch has been renamed topic/dpm2 to avoid conflict with another pre-existing branch. Sorry for the confusion.

Mar 04 '24 18:03 rhc54

Just FYI: when running Lisandro's test on a single node, I get the following error message on the first iteration:

--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: rhc-node01
  PID:        500
--------------------------------------------------------------------------

The rest of the iterations run silently. The communicator local and remote groups both look correct, so I'm not sure where OMPI is looking for the peer.

Mar 08 '24 20:03 rhc54

@hppritcha Current status: I have fixed a couple of bugs in the group construct operation regarding modex info storage, and I can now run Lisandro's test to completion. However, if I increase maxnp to greater than 1, then the MPI_Barrier hangs.

So it appears there is still come communication issue once we get a child job larger than 1 process. No error or warning messages are being printed, so I'm not sure where to start looking.

Let me know when you have time to look at this and I'll be happy to assist.

Mar 09 '24 19:03 rhc54

BTW: I updated the PMIx/PRRTE submodule pointers so they include the required support for the new dpm connect/accept/spawn method

Mar 09 '24 19:03 rhc54

@hppritcha I believe I know the cause of this last problem and am going to work on it. Meantime, I have opened a PR with the current status so we can see how it performs in CI: https://github.com/open-mpi/ompi/pull/12398

Mar 10 '24 21:03 rhc54

ompi ompi copied to clipboard

Up the submodule pointers for PMIx and PRRTE

ompi
ompi copied to clipboard