ch4: enablement on platforms with 8+ nics
Pull Request Description
Remove the 8-nic limit and enable MPICH to support high nic-count platforms
Ref: https://github.com/pmodels/mpich/issues/6688
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
@hzhou @raffenet Could you kindly review the change?
tag @nitbhat @tarudoodi for review
Converted change to draft for testing.
@hzhou Please take a look at the update. I followed your suggestion to deprecate MPIDI_OFI_MAX_NICS. As a result I had to change a few static variables and data structures to make their size and memory dynamic. I tried to make sure the memory is alloced and freed properly.
For my testing, I ran the osu microbenchmarks on 2 different platforms with ./configure ... --enable-g=all
- hpc6a.48xlarge with 1 NIC
- p5.48xlarge with 32 NICs
I added prints and verified that the code selects the right nic for each rank, and MPIR_CVAR_CH4_OFI_MAX_NICS can be used to restrict NIC visibility.
Please let me know your thoughts.
@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.
@wenduwan The PR looks ok to me. It keeps the usage that we had intended too. Please run the tests to make sure the testsuite passes.
@tarudoodi Thank you for looking! I have applied the whitespace diff and updated the PR.
tag @nitbhat @tarudoodi for review
Sorry the delay in getting back on this. Reviewed, the changes look good.
Do the CI tests exercise multi-nic code? Otherwise, it'll be good to manually test on some major multi-nic platforms to ensure that nic assignment looks good. (Platforms could be: aws instances, Aurora, Infiniband systems with multiple nics).
@hzhou Apologies for the inactivity. I brought up this change again to my leadership, and unfortunately we have decided to table the project for now due to resource contention. I will close the PR and reopen once we get a chance to re-evaluate the project.