Added MPI test runner for executing unit/functional tests
Details
Work item: Internal
What were the changes?
A simple C++ testing framework for multi-process RCCL testing using MPI and Google Test integration.
Why were the changes made?
Testing RCCL at unit and functional levels requires multiple processes with GPU-to-GPU data transfer to properly validate collective operations (AllReduce, Broadcast), point-to-point communication, and even specific features such as transport mechanism, etc.
Features
- Multi-process distributed testing: Each MPI rank runs in its own process with dedicated GPU
- Automatic RCCL communicator management: Test-specific communicators for complete isolation between tests
- Multi-node support: Works seamlessly on single-node (shared memory) or multi-node (network) configurations
- Google Test integration: Familiar test patterns with TEST_F and assertion macro
- Added provision to include non-GTest (standalone) tests
How was the outcome achieved?
- MPITestBase: Base class providing test infrastructure (located in test/common/MPITestBase.hpp/cpp)
- MPIEnvironment: Global MPI environment for initialization and cleanup
Additional Documentation:
Read test/README.md for more information.
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
I think this one is pretty solid. I really like the documentation and the descriptions as well.
One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?
I think this one is pretty solid. I really like the documentation and the descriptions as well.
One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?
@venksubramd I have addressed your comment. Added resource guards to handle cleaning up of the memory.
@eidenyoshida this PR only adds unit tests, no effect on performance. Regression testing not required.
@eidenyoshida this PR only adds unit tests, no effect on performance. Regression testing not required.
@corey-derochie-amd this PR was randomly selected just for testing the regression detector pipeline.
Nice work, addresses my previous comments. Thanks.