ompi
ompi copied to clipboard
Up the submodule pointers for PMIx and PRRTE
Test against OMPI CI
Thanks for the PR. I'm running AWS CI.
This PR failed AWS internal CI. Seeing a lot of failures
mpirun --wdir . -n 72 --hostfile hostfile --map-by ppr:36:node --timeout 1800 -x PATH mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Scatterv -npmin 72 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn36.txt
INFO root:utils.py:507 mpirun output:
--------------------------------------------------------------------------
It looks like MPI runtime init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during RTE init; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
num local peers
--> Returned "Bad parameter" (-5) instead of "Success" (0)
--------------------------------------------------------------------------
#x00*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
[ip-172-31-22-21:73523] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
.....
I'm running it again.
No immediate ideas - it is working fine for me, and obviously passed all your standard CI's. I'd have to know more about the particular setup to provide suggestions on what you might try.
@wenduwan Any update on this? I'm still unable to reproduce any problems.
I ran our tests again with many failures as shown above. I haven't got a chance to look into that yet.
A quick glance shows --enable-debug fixes those failures.
Finally I got some time to look into this.
The issue happens on 2 nodes during MPI_Init
...
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: components_open: component tcp open function successful
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component self
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component self returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component smcuda returned failure
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: component smcuda closed
[ip-172-31-4-77.us-west-2.compute.internal:26954] mca: base: close: unloading component smcuda
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component ofi
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component ofi returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component sm returned failure
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: component sm closed
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: base: close: unloading component sm
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: initializing btl component tcp
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl:tcp: 0xebbde0: if eth0 kidx 2 cnt 0 addr 172.31.12.182 IPv4 bw 100 lt 100
[ip-172-31-12-182.us-west-2.compute.internal:33452] btl: tcp: exchange: 0 2 IPv4 172.31.12.182
[ip-172-31-12-182.us-west-2.compute.internal:33452] select: init of component tcp returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component ofi returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component sm
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component sm returned success
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: initializing btl component tcp
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl:tcp: 0x26138f0: if eth0 kidx 2 cnt 0 addr 172.31.4.77 IPv4 bw 100 lt 100
[ip-172-31-4-77.us-west-2.compute.internal:26954] btl: tcp: exchange: 0 2 IPv4 172.31.4.77
[ip-172-31-4-77.us-west-2.compute.internal:26954] select: init of component tcp returned success
[ip-172-31-12-182.us-west-2.compute.internal:33452] mca: bml: Using self btl for send to [[54950,1],0] on node ip-172-31-12-182
[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml �
� �
�
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_mpi_instance_init failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-172-31-4-77:00000] *** An error occurred in MPI_Init
[ip-172-31-4-77:00000] *** reported by process [3601203201,1]
[ip-172-31-4-77:00000] *** on a NULL communicator
[ip-172-31-4-77:00000] *** Unknown error
[ip-172-31-4-77:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-4-77:00000] *** and MPI will try to terminate your MPI job as well)
Ignore
@wenduwan Pushed the latest state of the master branches - been a number of fixes since this was originally created.
Converted to "draft" to ensure nobody merges this by mistake
Unfortunately the same tests are still failing...
Afraid you aren't giving me much to work off of here 😞 I did see this from your printout above:
[ip-172-31-4-77.us-west-2.compute.internal:26954] [[54950,1],1] selected pml ob1, but peer [[54950,1],0] on unknown selected pml <trash>
You mentioned that --enable-debug made the problems go away? If so, that is suspiciously like what we see when someone mixes debug with non-debug libraries across nodes. The pack/unpack pairing gets off and things go haywire.
@rhc54 In our CI we build applications separately for mpi with debug vs non-debug so this shouldn't be an issue.
@hppritcha I wonder if someone on your side could quickly verify this PR with OMB/IMB on 2 nodes?
You don't need to run a bloody OMB test when things are failing in MPI_Init - just run MPI "hello". I can reproduce it with --disable-debug, but the error appears to be in the pml "checker" logic. Might be the first thing pulled up from the modex, so I'll take a look there.
Okay, I tracked it down and fixed it. Hopefully okay now!
Thank you Ralph. The test has passed.
Hooray! Anyone have an idea on what mpi4py is complaining about?
probably related to #12384
I've got the alternative "spawn" code working, but the MPI message between the parent and a child process seems to be hanging or isn't getting thru. I checked the modex recv calls and the btl/tcp connection info is all getting correctly transferred between all the procs (both parent and child). So I'm a little stumped.
Is there a simple way to trace the MPI send/recv procedure to see where the hang might be? Since the modex gets completed, I'm thinking that maybe the communicator isn't being fully constructed (since I eliminated the "nextcid" code), or perhaps the communicator doesn't have consistent ordering of procs in it.
where is the alternate spawn code? I can take a look at this later this week to see about what's going wrong.
It is in the topic/dpm branch of my OMPI fork: https://[email protected]/rhc54/ompi. Here is the commit message:
Add a second method for doing connect/accept
The "old" method relies on PMIx publish/lookup followed by
a call to PMIx_Connect. It then does a "next cid" method
to get the next communicator ID, which has multiple algorithms
including one that calls PMIx_Group.
Simplify this by using PMIx_Group_construct in place of
PMIx_Connect, and have it return the next communicator ID.
This is more scalable and reliable than the prior method.
Retain the "old" method for now as this is new code. Create
a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
between the two approaches. Default it to "true" for now
for ease of debugging.
NOTE: this includes an update to the submodule pointers
for PMIx and PRRTE to obtain the required updates to
those code bases.
Everything works fine, but the child rank=0 hangs in the MPI_Recv call waiting to get the message from the parent. I can't find the cause of the hang. I've printed out the contents of the communicators and they look fine, and I've checked that we aren't waiting for connection endpts (at least, I'm not seeing it).
Help is appreciated! Minus the message, this runs through thousands of comm_spawn loops without a problem.
Updated the submodule pointers to track PMIx/PRRTE master changes
@hppritcha The branch has been renamed topic/dpm2 to avoid conflict with another pre-existing branch. Sorry for the confusion.
Just FYI: when running Lisandro's test on a single node, I get the following error message on the first iteration:
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: rhc-node01
PID: 500
--------------------------------------------------------------------------
The rest of the iterations run silently. The communicator local and remote groups both look correct, so I'm not sure where OMPI is looking for the peer.
@hppritcha Current status: I have fixed a couple of bugs in the group construct operation regarding modex info storage, and I can now run Lisandro's test to completion. However, if I increase maxnp to greater than 1, then the MPI_Barrier hangs.
So it appears there is still come communication issue once we get a child job larger than 1 process. No error or warning messages are being printed, so I'm not sure where to start looking.
Let me know when you have time to look at this and I'll be happy to assist.
BTW: I updated the PMIx/PRRTE submodule pointers so they include the required support for the new dpm connect/accept/spawn method
@hppritcha I believe I know the cause of this last problem and am going to work on it. Meantime, I have opened a PR with the current status so we can see how it performs in CI: https://github.com/open-mpi/ompi/pull/12398