mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/ofi: Add support for NIC assignment for SNC4 mode for Aurora

Open tarudoodi opened this issue 2 years ago • 3 comments

Pull Request Description

The PR adds a preferred NIC assignment for ranks mapped to different sub-NUMA nodes when CPU is in SNC4 mode. The implementation is specific to Aurora node layout. PR also adds helper functions to identify the SNC4 nodes(reported as groups by hwloc) and to find the closest NICs for ranks on a specific SNC node.

Previous round robin NIC assignment is preserved for non-SNC mode.

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

tarudoodi avatar Oct 20 '23 22:10 tarudoodi

test:mpich/ch4/ofi

tarudoodi avatar Oct 20 '23 22:10 tarudoodi

@hzhou The tests passed.

tarudoodi avatar Oct 21 '23 05:10 tarudoodi

test:mpich/ch4/ofi

tarudoodi avatar Dec 15 '23 22:12 tarudoodi

test:mpich/ch4/ofi

hzhou avatar Mar 15 '24 04:03 hzhou

test:mpich/ch3/tcp

hzhou avatar Mar 15 '24 04:03 hzhou

Thanks for the review @hzhou! Latest testing passed prior to rebasing (ch4 testing, ch3 testing), so I think we can merge once the basic checks complete

abrooks98 avatar Mar 19 '24 18:03 abrooks98

Merging the branch since the tests passed and is rebased on top of latest main.

tarudoodi avatar Mar 20 '24 19:03 tarudoodi