ompi
ompi copied to clipboard
mpi4py: Run mpi4py tests with a relocated Open MPI installation
Addressing https://github.com/open-mpi/ompi/issues/12349#issuecomment-2090883149.
Hello! The Git Commit Checker CI bot found a few problems with this PR:
1ca576e9: mpi4py: Run mpi4py tests with a relocated Open MPI...
- check_signed_off: does not contain a valid Signed-off-by line
1163918c: mpi4py: Support for workflow_dispatch trigger
- check_signed_off: does not contain a valid Signed-off-by line
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!
@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.
bot:aws:retest
@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.
I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working.
Is there something else that is wrong?
I tested singleton spawn,
Indeed, the spawn tests run in isolation of other tests seem to be working fine. Did you finally merge these changes related to not using accept/connect and move to something more robust?
Is there something else that is wrong?
Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN.
@rhc54 The following command line is the minimal set of tests to trigger the issue:
env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm
I'm seeing the following lines in the output:
[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
[kw61149:364081] OPAL ERROR: Unreachable in file ../../../ompi-main/ompi/communicator/comm.c at line 2372
[kw61149:364081] 0: Error in ompi_comm_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR
Did you finally merge these changes related to not using accept/connect and move to something more robust?
Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.
[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago.
Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer?
Anyway, the problem is indeed fixed - just not over there for some reason.
I'll push a submodule update to this PR shortly
@janjust do you have bandwidth to update this PR? I can do it otherwise.
Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one.
Howard leapfrogged me with https://github.com/open-mpi/ompi/pull/12565 will likely merge that instead.
@dalcinl Could you please rebase the PR?
Hats off to @hppritcha for chasing down the problem(s).
@wenduwan If everything goes right, then maybe we should get rid of all the stuff related to running spawn tests conditionally? What do you think? Should we do it now in this PR, or would you rather prefer I submit another one?
Yes you are right. We are doing it in https://github.com/open-mpi/ompi/pull/12591
You have helped a lot. I will rebase #12591 after merging this PR.