ompi
ompi copied to clipboard
Add the acoll component
This PR introduces "acoll", a high-performant collective component that is optimized for communications within a single node of AMD EPYC CPUs. It mainly uses subcommunicators based on l3cache or numa to reduce cross-cache or cross-numa accesses. The supported collectives include Bcast, Allreduce, Gather, Reduce, Barrier, Allgather.
OSU micro-benchmarks were run on 2-socket AMD EPYC 9654 96-Core Processor with 4 NUMA domains per socket, with a total of 192 cores per node, on top of commit bb7ecde1d52d0c1f058d2585e9a1f9794df40684. Average percentage latency reduction over "tuned" across 32, 64, 96, 128, 192 ranks over message sizes of 8 bytes to 8 MB (varied in powers of 2):
- Allreduce: 37.7%
- Bcast: 39.4%
- Gather: 27.5%
Sample graphs:
Allreduce
Bcast
Gather
How this compares with https://github.com/open-mpi/ompi/pull/10470 ?
How this compares with #10470 ?
We can discuss it at the meeting. Part of the goal of filing the pr was to give people the ability to have a look at it ahead of the meeting if they want/can.
Please rebase to current main to get rid of the mpi4py failure.
I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug
# [ pairs: 18 ] [ window size: 64 ]
# Size MB/s Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.
You can try osu_mbw_mr.
@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?
@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?
@hppritcha we did have issues after rebase. Have fixed the issues, will update the PR soon. Thanks.
I tested the PR in AWS CI. I'm seeing assertion errors with
--enable-debug# [ pairs: 18 ] [ window size: 64 ] # Size MB/s Messages/s osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.You can try
osu_mbw_mr.
The updated PR (yet to be pushed) will fix this issue. Thanks.
I tested the PR in AWS CI. I'm seeing assertion errors with
--enable-debug# [ pairs: 18 ] [ window size: 64 ] # Size MB/s Messages/s osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.You can try
osu_mbw_mr.
The issue is fixed in the updated PR.
@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?
We have updated the PR, it passes the mpi4py tests.
Running AWS CI
@amd-nithyavs I noticed that the PR is currently split into 3 commits. Please squash them before merging.
Passed AWS CI. Note that we don't test with xpmem.
Hello! The Git Commit Checker CI bot found a few problems with this PR:
2f7c5e28: Merge latest of local ompiv5
- check_signed_off: does not contain a valid Signed-off-by line
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!
@wenduwan We have rebased to the latest and squashed the commits.
@wenduwan we have addressed the comments and incorporated some big count related changes as well.
@bosilca @devreal could you please review the PR as well? Wenduo has also provided an initial set of comments which we have addressed.
Thanks, Manu
Please add the owner.txt file.
Ack
bot:retest