ompi icon indicating copy to clipboard operation
ompi copied to clipboard

3rd-party:pmix advance sha to e32e0179bc

Open hppritcha opened this issue 1 year ago • 14 comments
trafficstars

advance sha to e32e0179bc.

related to #12532

hppritcha avatar May 21 '24 18:05 hppritcha

a test pr. i could not reproduce the mpi4py failures doing standalone testing of the pmix sha advance part of #12532

hppritcha avatar May 21 '24 18:05 hppritcha

@wenduwan could you check this PR with AWS internal CI?

hppritcha avatar May 21 '24 18:05 hppritcha

Thank you. I'm running our CI.

wenduwan avatar May 21 '24 18:05 wenduwan

@hppritcha Sorry I forgot to report back. Our CI passed. 🙂

wenduwan avatar May 23 '24 15:05 wenduwan

Hmm... so why is this PR passing CI and not #12532 ?

hppritcha avatar May 23 '24 15:05 hppritcha

@hppritcha Just curious - can you push again to run the mpi4py tests?

wenduwan avatar May 23 '24 16:05 wenduwan

you mean on #12532 ? I wouldn't suprised if it still fails since you also switched back to prrte upstream master.

hppritcha avatar May 23 '24 16:05 hppritcha

@hppritcha Wow... first time ever than all mpi4py spawn tests seems to work. @wenduwan Is this submod pointer update expected to land in branch v5.0.x?

dalcinl avatar May 24 '24 14:05 dalcinl

Is this submod pointer update expected to land in branch v5.0.x?

@dalcinl We have to figure something out for 5.0.x. The release branch has to track an upstream release/release candidate.

wenduwan avatar May 24 '24 14:05 wenduwan

i think its passing spawn tests because the 3rd commit disables use of the PMIX gds shmem2 component.

hppritcha avatar May 24 '24 17:05 hppritcha

i think its passing spawn tests because the 3rd commit disables use of the PMIX gds shmem2 component.

That seems difficult to believe - can you remove that commit and prove that statement? Would be good to know if true as that is (as noted) a commit with significant negative consequences.

rhc54 avatar May 24 '24 18:05 rhc54

undo effect of third commit and see what happens...

hppritcha avatar May 24 '24 21:05 hppritcha

Thanks @hppritcha ! Looks like there is some connection, but I'm darned if I can understand it:

testCommSpawn (test_spawn.TestSpawnMultipleSelf.testCommSpawn) ... [fv-az1272-211:00000] *** An error occurred in Socket closed
[fv-az1272-211:00000] *** reported by process [2538078209,3]
[fv-az1272-211:00000] *** on a NULL communicator
[fv-az1272-211:00000] *** Unknown error
[fv-az1272-211:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az1272-211:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
�--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: fv-az1272-211
  Local PID:  3624
  Peer host:  unknown

Looks like the issue is in the MPI layer??

rhc54 avatar May 24 '24 22:05 rhc54

Ohhhh...that's interesting! It failed this time, even with the shmem disabled. Looks like some kind of race condition is in play. Sigh - would be nice if we had a definitive answer. This doesn't contain the corresponding PRRTE fix for the modex, so that might be part of it. Might have to wait for the grpcomm rework before we have a real answer.

rhc54 avatar May 24 '24 22:05 rhc54

ran CI 10 times with95d8c10 without a failure in the spawn methods. Added another commit to pretty-up the code change but should not impact its behavior.

hppritcha avatar May 28 '24 21:05 hppritcha

Does everything still pass if you remove the hash restriction?

rhc54 avatar May 28 '24 21:05 rhc54

Ah good - the PMIx shmem vs hash has nothing to do with it, which is pretty much as expected. So it looks like the problem has been in the async exit from the function, meaning that messages could be sent before the receiver was ready for them. Thanks for tracking it down @hppritcha !

rhc54 avatar May 29 '24 18:05 rhc54

okay looks like the pmix gds was a red herring. i'll clean up this PR back to just down to the single sha update commit. First rebase though after #12588 is merged in to main.

hppritcha avatar May 29 '24 20:05 hppritcha