ompi
ompi copied to clipboard
sessions: add support for ucx and more
Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion.
Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided.
Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose.
Related to #12566
@rhc54 could you check the PMIx usage here?
Running AWS CI
hmmm that coredump is coming from prrte -
Core was generated by `prterun --map-by :OVERSUBSCRIBE -np 4 python3 ./main.py -v -f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
313 if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
[Current thread is 1 (Thread 0xffffb0c28f40 (LWP 3683323))]
(gdb) bt
#0 0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
#1 0x0000ffffb0a9bac4 in allgather_stub (fd=-1, args=4, cbdata=0xffffa80df400) at base/grpcomm_base_stubs.c:159
#2 0x0000ffffb04ee628 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
#3 0x0000ffffb04ef290 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#4 0x0000000000408fd4 in main (argc=9, argv=0xfffff697ac38) at prte.c:1297
(gdb) list
308
309 /* see if I am in the array of participants - note that I may
310 * be in the rollup tree even though I'm not participating
311 * in the collective itself */
312 for (n = 0; n < coll->ndmns; n++) {
313 if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
314 coll->nexpected++;
315 break;
316 }
317 }
(gdb) print n
$1 = 0
(gdb) print coll
$2 = (prte_grpcomm_coll_t *) 0x137b0da0
(gdb) print coll->ndmns
$3 = 1
(gdb) print coll->dmns
$4 = (pmix_rank_t *) 0x0
(gdb)
hmmm that coredump is coming from prrte -
Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.
I'll try to look thru the PMIx stuff as time permits.
hmmm that coredump is coming from prrte -
Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.
I'll try to look thru the PMIx stuff as time permits.
I did some digging in to this and it turns out that the Open MPI is feeding a corrupted proc list into PMIx_Group_construct for at least one rank, that's ending up causing problems in prte and ulitimately the segfault. not sure how pmix could validate that kind of input.
You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. PMIX_CHECK_NSPACE has the protection for that much - could be we have a place in PRRTE that isn't using it.
You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land.
PMIX_CHECK_NSPACEhas the protection for that much - could be we have a place in PRRTE that isn't using it.
a namespace sanity check in prrte group construct may have caught this but not sure.
FYI AWS CI passed for this PR
@wenduwan could you rerun AWS CI?
@wenduwan when you have a chance could you check changes per your comments?
@janjust please review - esp. UCX part when you get a chance
- we now have a critical component depending on a software we have no control over
The dependency has been true for a very long time, actually - OMPI doesn't run at all without PMIx and hasn't for years. In particular for this PR, PMIx has been in the path for connect/accept for years, and the collective used here is the same as the collective currently in that path. The only difference being gradually introduced is to replace the multiple calls to PMIx functions combined with a bunch of MPI ops with a single call to the PMIx group op.
So you should eventually see a performance improvement, although it may take another PR step to get it as I withdrew the PR that consolidated to a single op. This just takes an initial step in that direction.
As for control - OMPI continues to have (in some opinions, an undue) significant influence over PMIx. As noted elsewhere, the only real "disconnect" has been the lack of code contribution by OMPI members, which means that things like faster "allgather" in PRRTE don't get implemented - not due to refusal, but just due to lack of contribution.
Just for clarification - we only use the PMIx_Group_construct method for communicators being created using MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups. Fortunately for us, the mpi4py testsuite has a number of tests which exercise this functionality.
I have seen the question of the size of the context ID raised several times, so let me provide a little perspective on it. In PMIx, the context ID for a group is specified to be size_t. Of course, the actual size of the value being returned is set by the RTE since it is solely responsible for assigning that value. In the case of PRRTE, the variable tracking context IDs has a uint32_t definition.
@hppritcha had previously (a long time ago, now) requested that PRRTE start assigning values at UINT32_MAX and count downwards from there. IIRC, this was to avoid conflict with OMPI's internal CID assignment code that started at 0 (for MPI_COMM_WORLD) and counted up.
I don't really understand all the bit manipulation going on here, so I am probably missing something - but it strikes me that if you are limiting the context ID to something less than 32-bits, and using the CID returned by PMIx, then you have a problem.
Okay sorry for not getting back to this thread sooner. Just very busy in my ompi time with other things at the moment.
There's a very important pml parameter that PML components are suppose to initialize - pml_max_contextid before being used for sending/receiving communicator based messages.
There's also a variant of this for the MTL components.
Anyway this quantity is used for controlling max value c_index that is the subject of comments above.
Also the PMIx group id is now only used to extract modex data as can be seen in the code changes in this PR. The excid that Nathan developed (which used pmix group id in a 128 bit extended context id) is not really used now except in OB1. Eventually we'll remove that but not as part of 6.0.x. It was one of those ideas that's good in theory, but in the general case of external libraries providing the lower level APIs for moving data round, it's not really maintainable. The excid stuff is described in https://doi.org/10.1109/CLUSTER.2019.8891002
Sooo...you are telling me that I went thru all the trouble of adding the ability for PMIx and PRRTE to provide context IDs as part of the group operation that returns modex data - for nothing? You'd rather not only do the group op, but also do a kluge multi-stage operation to create the ID??
Huh? look at the code changes in this PR for file ompi/communicator/comm_cid.c . there you see we do indeed use the pmix group context id to look up the local cid for the target receiver. We just are now looking up the receivers local cid in a different way than nathan's difficult to maintain (much less implement in PMLs besides OB1) out-of-band bootstrap method.
Okay, @hppritcha - I confess to confusion. Your statement was:
The excid that Nathan developed (which used pmix group id in a 128 bit extended context id) is not really used now except in OB1. Eventually we'll remove that but not as part of 6.0.x. It was one of those ideas that's good in theory, but in the general case of external libraries providing the lower level APIs for moving data round, it's not really maintainable.
I was reacting to this statement as it implied (as I read it) that the use of PMIx to generate a context ID was something to be removed as PMIx is the only "external library" involved (to the best of my knowledge). I confess I haven't read thru this PR to try and fully grok all the changes, so if I have misunderstood, I apologize for it. I do see the use of PMIx group ops - just was taking the understanding that you intended to remove those at some post 6.0 move.
AFAIK, the group operation is working correctly in terms of generating IDs. I know there is a problem in PRRTE/PMIx master branches (and in their release branches) regarding correctly collecting all required info to be exchanged. I've been working on it, albeit slowly. Had to step away for awhile, but am now gradually attempting to restart it.
@janjust please review when you have a chance. i think i've addressed the various items noted by the nivida reviewers.
Any objections or suggestions for changes?
@yosefe could you check now?
not sure why read the docs failed.
@yosefe ping...
I want this to be clearly stated: this functionality fulfills the needs of two newly added MPI group creation (MPI_Comm_create_from_group and MPI_Intercomm_create_from_group) by replacing an existing, AM-based, implementation with a more generic, PMIX-based, one. However, this new approach loses in scalability what it gains in generality, as we are exchanging an RTT to retrieve the comm_id for each peer that communicate together against a full scale allgather done not even in the context of OMPI but in PMIX.
Moreover, no performance results were provided to assess the cost of this change. We all heard the argument that someone in the future might possibly try to optimize this particular operation in PMIX. Irrelevant ! As soon as this becomes the only way to create such communicators, we will be relying on an unoptimized PMIX capability that is far from the level of optimization we would expect from MPI-level collectives.
Errrr...just to be clear, the collective is in PRRTE and has absolutely nothing to do with PMIx. All PMIx does is locally collect the info and pass it to the local PRRTE daemon.
@yosefe @janjust could one of you check the changes I did per yosefe request?
Actually this PR doesn't replace the AM based approach used in OB1. That path is still there. This PR just enables the UCX PML to work with communicators created using MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups. If at some point, someone in the UCX community wants to go to the effort to add an AM path in the UCX PML that could be done.
Not from OB1 but it removes it from OFI MTL.
Okay i'll put the mtl code back in.