coll: add coll_group to collective interfaces
Pull Request Description
Make all (most) collective algorithms able to work within a subgroup.
- Replace subcomm with lightweight
MPIR_Subgroup MPIR_Subgroupdiffers fromMPIR_Groupas the latter does not live inside a communicator, thus overly complex and inefficient to use.- Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
- This PR will provide a few examples.
- Use
MPIR_SUBGROUP_NONEin place ofcoll_groupargument will provide backward collective semantics, i.e. the whole communicator collective.
typedef struct MPIR_Subgroup {
int size;
int rank;
int *proc_table;
} MPIR_Subgroup;
mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
^^^^^^^^^^
- Replacing subcomm usage with subgroup
MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);
become
MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);
One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.
- inter-collectives only work with
MPIR_SUBGROUP_NONE(no group inter collectives) - compositional algorithms (i.e.
_smp) only work withMPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation - All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure. [skip warnings]
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/most test:mpich/ch3/most
Only ch4-ofi-shm 42 failures in allgather2
test:mpich/ch4/most test:mpich/ch3/most
test:mpich/ch4/most test:mpich/ch3/most
1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)
Try get a clean test:
test:mpich/ch4/ucx
MPIR_Subgroupdiffers fromMPIR_Groupas the latter does not live inside a communicator, thus overly complex and inefficient to use.
What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.
MPIR_Subgroupdiffers fromMPIR_Groupas the latter does not live inside a communicator, thus overly complex and inefficient to use.What is meant by
MPIR_Grouplives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have inMPI[R]_Group.
MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.
When we use MPIR_Group for communication -
- we need to find a communicator for its communication context
- we need translate the
lpidused inMPIR_Groupinto the ranks in that communicator - So it's more cumbersome to use, and easily can confuse developers who are not familiar.
- Also,
MPIR_Grouprepresents and is confined byMPI_Group. Thus it's less flexible to adapt to our internal usages.
test:mpich/ch4/most test:mpich/ch3/most
Only 2 timouts in ch4-ofi-default due to congestions:
datatype.01767 - ./datatype/large_type_sendrec 2 33
coll.00127 - ./coll/gather_big 8
test:mpich/pmi test:mpich/ch4/xpmem
test:mpich/ch4/most test:mpich/ch3/most
@zhenggb72 What is your suggested path forward?
@zhenggb72 What is your suggested path forward?
I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you do to the subgroup is up to you. I don't know much about the motivation and use case of subgroup, but maybe there are a few solutions: for subgroup, you can choose to skip CSEL, and go straight to MPIR auto or fallback algorithms, or you can choose not to prune the tree and use the global tree, if you don't want to prune it.
We are delaying this PR to 4.4. To help merging this PR, we'll need more performance measurement to quantify the memory and performance impact. Also get more input to ensure everyone is on board.
💡 Maybe we can simplify this PR if we are able to create lightweight communicator from a coll_group that shares the context_id from parent.