mpich coll: improve alltoallv

Pull Request Description

add MPIR_CVAR_CH4_PROGRESS_THROTTLE Q: should we always enable progress THROTTLE?
The naive linear pairing will hold the large ranks until lower ranks get them. Rank N-1 will blocked at first exchange until Rank 0 near finish.

Slightly improve the algorithm, esp. for the high PPN case, do pair-wise exhcanges within each node first. Then finish the rest naive pairing over internode.

Also, the double loop then selecting rank seem to be a silly way of a single loop.

A better pairing by selecting sendrecv pairs using bit flipping. This exchanges with self first, then immediate neighbor, then neighbors at further bit distances. If the number of processes on each node is consecutive and takes power of 2, it will capture the node-first pairing as in the previous algorithm.
[ ] The same optimization should apply to the linear pairwise algorithms in alltoall and alltoallw. The code smells like need a refactoring. [skip warnings]

Author Checklist

[x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
[x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
[ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
[x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

Apr 04 '25 22:04 hzhou

test:mpich/ch3/most test:mpich/ch4/most

Apr 06 '25 02:04 hzhou

test:mpich/custom env: MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1

test:mpich/ch3/most test:mpich/ch4/most

Oct 15 '25 16:10 hzhou

test:mpich/custom env: MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1

test:mpich/ch3/most test:mpich/ch4/most

Oct 15 '25 19:10 hzhou

The tests passed

Oct 15 '25 22:10 hzhou