mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4: enablement on platforms with 8+ nics

Open wenduwan opened this issue 2 years ago • 7 comments

Pull Request Description

Remove the 8-nic limit and enable MPICH to support high nic-count platforms

Ref: https://github.com/pmodels/mpich/issues/6688

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

wenduwan avatar Sep 22 '23 02:09 wenduwan

@hzhou @raffenet Could you kindly review the change?

wenduwan avatar Sep 25 '23 17:09 wenduwan

tag @nitbhat @tarudoodi for review

hzhou avatar Oct 04 '23 01:10 hzhou

Converted change to draft for testing.

wenduwan avatar Oct 04 '23 20:10 wenduwan

@hzhou Please take a look at the update. I followed your suggestion to deprecate MPIDI_OFI_MAX_NICS. As a result I had to change a few static variables and data structures to make their size and memory dynamic. I tried to make sure the memory is alloced and freed properly.

For my testing, I ran the osu microbenchmarks on 2 different platforms with ./configure ... --enable-g=all

  1. hpc6a.48xlarge with 1 NIC
  2. p5.48xlarge with 32 NICs

I added prints and verified that the code selects the right nic for each rank, and MPIR_CVAR_CH4_OFI_MAX_NICS can be used to restrict NIC visibility.

Please let me know your thoughts.

wenduwan avatar Oct 09 '23 14:10 wenduwan

@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.

tarudoodi avatar Oct 10 '23 23:10 tarudoodi

@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.

@tarudoodi Thank you for looking! I have applied the whitespace diff and updated the PR.

wenduwan avatar Oct 13 '23 01:10 wenduwan

tag @nitbhat @tarudoodi for review

Sorry the delay in getting back on this. Reviewed, the changes look good.

Do the CI tests exercise multi-nic code? Otherwise, it'll be good to manually test on some major multi-nic platforms to ensure that nic assignment looks good. (Platforms could be: aws instances, Aurora, Infiniband systems with multiple nics).

nitbhat avatar Oct 17 '23 09:10 nitbhat

@hzhou Apologies for the inactivity. I brought up this change again to my leadership, and unfortunately we have decided to table the project for now due to resource contention. I will close the PR and reopen once we get a chance to re-evaluate the project.

wenduwan avatar Aug 13 '24 14:08 wenduwan