ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Add the acoll component

Open amd-nithyavs opened this issue 1 year ago • 16 comments
trafficstars

This PR introduces "acoll", a high-performant collective component that is optimized for communications within a single node of AMD EPYC CPUs. It mainly uses subcommunicators based on l3cache or numa to reduce cross-cache or cross-numa accesses. The supported collectives include Bcast, Allreduce, Gather, Reduce, Barrier, Allgather.

OSU micro-benchmarks were run on 2-socket AMD EPYC 9654 96-Core Processor with 4 NUMA domains per socket, with a total of 192 cores per node, on top of commit bb7ecde1d52d0c1f058d2585e9a1f9794df40684. Average percentage latency reduction over "tuned" across 32, 64, 96, 128, 192 ranks over message sizes of 8 bytes to 8 MB (varied in powers of 2):

  • Allreduce: 37.7%
  • Bcast: 39.4%
  • Gather: 27.5%

Sample graphs:

Allreduce allreduce

Bcast bcast

Gather gather

amd-nithyavs avatar Apr 22 '24 17:04 amd-nithyavs

How this compares with https://github.com/open-mpi/ompi/pull/10470 ?

bosilca avatar Apr 22 '24 18:04 bosilca

How this compares with #10470 ?

We can discuss it at the meeting. Part of the goal of filing the pr was to give people the ability to have a look at it ahead of the meeting if they want/can.

edgargabriel avatar Apr 22 '24 18:04 edgargabriel

Please rebase to current main to get rid of the mpi4py failure.

bosilca avatar Apr 29 '24 20:04 bosilca

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

wenduwan avatar Apr 30 '24 19:04 wenduwan

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

hppritcha avatar May 01 '24 18:05 hppritcha

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

@hppritcha we did have issues after rebase. Have fixed the issues, will update the PR soon. Thanks.

mshanthagit avatar May 01 '24 18:05 mshanthagit

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

The updated PR (yet to be pushed) will fix this issue. Thanks.

mshanthagit avatar May 01 '24 18:05 mshanthagit

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

The issue is fixed in the updated PR.

amd-nithyavs avatar May 08 '24 04:05 amd-nithyavs

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

We have updated the PR, it passes the mpi4py tests.

amd-nithyavs avatar May 08 '24 04:05 amd-nithyavs

Running AWS CI

wenduwan avatar May 08 '24 13:05 wenduwan

@amd-nithyavs I noticed that the PR is currently split into 3 commits. Please squash them before merging.

wenduwan avatar May 08 '24 13:05 wenduwan

Passed AWS CI. Note that we don't test with xpmem.

wenduwan avatar May 08 '24 16:05 wenduwan

Hello! The Git Commit Checker CI bot found a few problems with this PR:

2f7c5e28: Merge latest of local ompiv5

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions[bot] avatar May 10 '24 05:05 github-actions[bot]

@wenduwan We have rebased to the latest and squashed the commits.

amd-nithyavs avatar May 10 '24 06:05 amd-nithyavs

@wenduwan we have addressed the comments and incorporated some big count related changes as well.

mshanthagit avatar May 21 '24 14:05 mshanthagit

@bosilca @devreal could you please review the PR as well? Wenduo has also provided an initial set of comments which we have addressed.

Thanks, Manu

mshanthagit avatar Jun 04 '24 16:06 mshanthagit

Please add the owner.txt file.

Ack

amd-nithyavs avatar Jul 16 '24 09:07 amd-nithyavs

bot:retest

edgargabriel avatar Jul 16 '24 15:07 edgargabriel