ompi
ompi copied to clipboard
3rd-party:pmix advance sha to e32e0179bc
advance sha to e32e0179bc.
related to #12532
a test pr. i could not reproduce the mpi4py failures doing standalone testing of the pmix sha advance part of #12532
@wenduwan could you check this PR with AWS internal CI?
Thank you. I'm running our CI.
@hppritcha Sorry I forgot to report back. Our CI passed. 🙂
Hmm... so why is this PR passing CI and not #12532 ?
@hppritcha Just curious - can you push again to run the mpi4py tests?
you mean on #12532 ? I wouldn't suprised if it still fails since you also switched back to prrte upstream master.
@hppritcha Wow... first time ever than all mpi4py spawn tests seems to work. @wenduwan Is this submod pointer update expected to land in branch v5.0.x?
Is this submod pointer update expected to land in branch v5.0.x?
@dalcinl We have to figure something out for 5.0.x. The release branch has to track an upstream release/release candidate.
i think its passing spawn tests because the 3rd commit disables use of the PMIX gds shmem2 component.
i think its passing spawn tests because the 3rd commit disables use of the PMIX gds shmem2 component.
That seems difficult to believe - can you remove that commit and prove that statement? Would be good to know if true as that is (as noted) a commit with significant negative consequences.
undo effect of third commit and see what happens...
Thanks @hppritcha ! Looks like there is some connection, but I'm darned if I can understand it:
testCommSpawn (test_spawn.TestSpawnMultipleSelf.testCommSpawn) ... [fv-az1272-211:00000] *** An error occurred in Socket closed
[fv-az1272-211:00000] *** reported by process [2538078209,3]
[fv-az1272-211:00000] *** on a NULL communicator
[fv-az1272-211:00000] *** Unknown error
[fv-az1272-211:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az1272-211:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
�--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: fv-az1272-211
Local PID: 3624
Peer host: unknown
Looks like the issue is in the MPI layer??
Ohhhh...that's interesting! It failed this time, even with the shmem disabled. Looks like some kind of race condition is in play. Sigh - would be nice if we had a definitive answer. This doesn't contain the corresponding PRRTE fix for the modex, so that might be part of it. Might have to wait for the grpcomm rework before we have a real answer.
ran CI 10 times with95d8c10 without a failure in the spawn methods. Added another commit to pretty-up the code change but should not impact its behavior.
Does everything still pass if you remove the hash restriction?
Ah good - the PMIx shmem vs hash has nothing to do with it, which is pretty much as expected. So it looks like the problem has been in the async exit from the function, meaning that messages could be sent before the receiver was ready for them. Thanks for tracking it down @hppritcha !
okay looks like the pmix gds was a red herring. i'll clean up this PR back to just down to the single sha update commit. First rebase though after #12588 is merged in to main.