ompi icon indicating copy to clipboard operation
ompi copied to clipboard

mpi4py: Run mpi4py tests with a relocated Open MPI installation

Open dalcinl opened this issue 1 year ago • 11 comments
trafficstars

Addressing https://github.com/open-mpi/ompi/issues/12349#issuecomment-2090883149.

dalcinl avatar May 06 '24 17:05 dalcinl

Hello! The Git Commit Checker CI bot found a few problems with this PR:

1ca576e9: mpi4py: Run mpi4py tests with a relocated Open MPI...

  • check_signed_off: does not contain a valid Signed-off-by line

1163918c: mpi4py: Support for workflow_dispatch trigger

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions[bot] avatar May 06 '24 17:05 github-actions[bot]

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

dalcinl avatar May 06 '24 18:05 dalcinl

bot:aws:retest

wenduwan avatar May 06 '24 19:05 wenduwan

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working.

Is there something else that is wrong?

rhc54 avatar May 06 '24 22:05 rhc54

I tested singleton spawn,

Indeed, the spawn tests run in isolation of other tests seem to be working fine. Did you finally merge these changes related to not using accept/connect and move to something more robust?

Is there something else that is wrong?

Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN.

dalcinl avatar May 07 '24 06:05 dalcinl

@rhc54 The following command line is the minimal set of tests to trigger the issue:

env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm

I'm seeing the following lines in the output:

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
[kw61149:364081] OPAL ERROR: Unreachable in file ../../../ompi-main/ompi/communicator/comm.c at line 2372
[kw61149:364081] 0: Error in ompi_comm_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

dalcinl avatar May 07 '24 07:05 dalcinl

Did you finally merge these changes related to not using accept/connect and move to something more robust?

Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��

Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago.

Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer?

Anyway, the problem is indeed fixed - just not over there for some reason.

rhc54 avatar May 07 '24 12:05 rhc54

I'll push a submodule update to this PR shortly

janjust avatar May 07 '24 15:05 janjust

@janjust do you have bandwidth to update this PR? I can do it otherwise.

wenduwan avatar May 08 '24 18:05 wenduwan

Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one.

wenduwan avatar May 08 '24 18:05 wenduwan

Howard leapfrogged me with https://github.com/open-mpi/ompi/pull/12565 will likely merge that instead.

wenduwan avatar May 23 '24 20:05 wenduwan

@dalcinl Could you please rebase the PR?

Hats off to @hppritcha for chasing down the problem(s).

wenduwan avatar May 30 '24 19:05 wenduwan

@wenduwan If everything goes right, then maybe we should get rid of all the stuff related to running spawn tests conditionally? What do you think? Should we do it now in this PR, or would you rather prefer I submit another one?

dalcinl avatar May 31 '24 10:05 dalcinl

Yes you are right. We are doing it in https://github.com/open-mpi/ompi/pull/12591

wenduwan avatar May 31 '24 15:05 wenduwan

You have helped a lot. I will rebase #12591 after merging this PR.

wenduwan avatar May 31 '24 15:05 wenduwan