mpich icon indicating copy to clipboard operation
mpich copied to clipboard

coll: add coll_group to collective interfaces

Open hzhou opened this issue 1 year ago • 13 comments

Pull Request Description

Make all (most) collective algorithms able to work within a subgroup.

  • Replace subcomm with lightweight MPIR_Subgroup
  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.
  • Communicator is a very complex object that takes on all kinds of tricks such as attributes, hints, cache, context id ... Maintaining sub-communicator compounds these complexities and are expensive to maintain. The goal of this PR is to eventually remove sub-communicators.
  • This PR will provide a few examples.
  • Use MPIR_SUBGROUP_NONE in place of coll_group argument will provide backward collective semantics, i.e. the whole communicator collective.
typedef struct MPIR_Subgroup {
    int size;
    int rank;
    int *proc_table;
} MPIR_Subgroup; 
mpi_errno = MPIR_Bcast_impl(buffer, count, datatype, root, comm, coll_group, errflag);
                                                                 ^^^^^^^^^^
  • Replacing subcomm usage with subgroup
MPIR_Bcast(buf, count, datatype, root, node_comm, MPIR_ERR_NONE);

become

MPIR_Bcast(buf, count, datatype, root, comm, MPIR_SUBGROUP_NODE, MPIR_ERR_NONE);

One of the goals of this PR is to make all mpir- layer intra-collectives coll_group aware.

  • inter-collectives only work with MPIR_SUBGROUP_NONE (no group inter collectives)
  • compositional algorithms (i.e. _smp) only work with MPIR_SUBGROUP_NONE, algo selection need make sure not to create recursive compositional situation
  • All non-compositional intra algorithms should work with all coll_group. So it should not directly access (size,rank)-related fields in the communicator structure. [skip warnings]

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Aug 16 '24 23:08 hzhou

test:mpich/ch4/most test:mpich/ch3/most

Only ch4-ofi-shm 42 failures in allgather2

hzhou avatar Aug 23 '24 03:08 hzhou

test:mpich/ch4/most test:mpich/ch3/most

hzhou avatar Aug 24 '24 22:08 hzhou

test:mpich/ch4/most test:mpich/ch3/most

1 failure - ch4-ucx-external - [coll.01376 - ./coll/reduce 10 MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=release_gather MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE=131072 MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE=32768 MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS=4 MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL=8 MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE=knomial_2](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ucx/3401/jenkins_configure=external,label=ubuntu22.04_review/testReport/junit/(root)/coll/01376_____coll_reduce_10__MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM_release_gather_MPIR_CVAR_COLL_SHM_LIMIT_PER_NODE_131072_MPIR_CVAR_REDUCE_INTRANODE_BUFFER_TOTAL_SIZE_32768_MPIR_CVAR_REDUCE_INTRANODE_NUM_CELLS_4_MPIR_CVAR_REDUCE_INTRANODE_TREE_KVAL_8_MPIR_CVAR_REDUCE_INTRANODE_TREE_TYPE_knomial_2/)

hzhou avatar Aug 25 '24 01:08 hzhou

Try get a clean test:

test:mpich/ch4/ucx

hzhou avatar Aug 26 '24 15:08 hzhou

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

raffenet avatar Sep 06 '24 13:09 raffenet

  • MPIR_Subgroup differs from MPIR_Group as the latter does not live inside a communicator, thus overly complex and inefficient to use.

What is meant by MPIR_Group lives inside a communicator? Groups are independent objects, no? I am starting to wonder why we are adding the complexity of a whole new object and allocation scheme vs using what we have in MPI[R]_Group.

MPIR_Group is not associated with any communicator. Unlike MPIR_Subgroup, it's inside a communicator.

When we use MPIR_Group for communication -

  1. we need to find a communicator for its communication context
  2. we need translate the lpid used in MPIR_Group into the ranks in that communicator
  3. So it's more cumbersome to use, and easily can confuse developers who are not familiar.
  4. Also, MPIR_Group represents and is confined by MPI_Group. Thus it's less flexible to adapt to our internal usages.

hzhou avatar Sep 06 '24 14:09 hzhou

test:mpich/ch4/most test:mpich/ch3/most

Only 2 timouts in ch4-ofi-default due to congestions:

datatype.01767 - ./datatype/large_type_sendrec 2 33 
coll.00127 - ./coll/gather_big 8

hzhou avatar Sep 12 '24 23:09 hzhou

test:mpich/pmi test:mpich/ch4/xpmem

hzhou avatar Sep 14 '24 14:09 hzhou

test:mpich/ch4/most test:mpich/ch3/most

hzhou avatar Nov 03 '24 23:11 hzhou

@zhenggb72 What is your suggested path forward?

hzhou avatar Nov 11 '24 14:11 hzhou

@zhenggb72 What is your suggested path forward?

I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you do to the subgroup is up to you. I don't know much about the motivation and use case of subgroup, but maybe there are a few solutions: for subgroup, you can choose to skip CSEL, and go straight to MPIR auto or fallback algorithms, or you can choose not to prune the tree and use the global tree, if you don't want to prune it.

zhenggb72 avatar Nov 11 '24 15:11 zhenggb72

We are delaying this PR to 4.4. To help merging this PR, we'll need more performance measurement to quantify the memory and performance impact. Also get more input to ensure everyone is on board.

hzhou avatar Nov 11 '24 17:11 hzhou

💡 Maybe we can simplify this PR if we are able to create lightweight communicator from a coll_group that shares the context_id from parent.

hzhou avatar Mar 21 '25 21:03 hzhou