rccl Added MPI test runner for executing unit/functional tests

Details

Work item: Internal

What were the changes?
A simple C++ testing framework for multi-process RCCL testing using MPI and Google Test integration.

Why were the changes made?
Testing RCCL at unit and functional levels requires multiple processes with GPU-to-GPU data transfer to properly validate collective operations (AllReduce, Broadcast), point-to-point communication, and even specific features such as transport mechanism, etc. Features

Multi-process distributed testing: Each MPI rank runs in its own process with dedicated GPU
Automatic RCCL communicator management: Test-specific communicators for complete isolation between tests
Multi-node support: Works seamlessly on single-node (shared memory) or multi-node (network) configurations
Google Test integration: Familiar test patterns with TEST_F and assertion macro
Added provision to include non-GTest (standalone) tests

How was the outcome achieved?

MPITestBase: Base class providing test infrastructure (located in test/common/MPITestBase.hpp/cpp)
MPIEnvironment: Global MPI environment for initialization and cleanup

Additional Documentation:
Read test/README.md for more information.

Approval Checklist

Do not approve until these items are satisfied.

[ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

Oct 21 '25 23:10 atulkulk

I think this one is pretty solid. I really like the documentation and the descriptions as well.

One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?

Oct 24 '25 16:10 venksubramd

I think this one is pretty solid. I really like the documentation and the descriptions as well.

One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?

@venksubramd I have addressed your comment. Added resource guards to handle cleaning up of the memory.

Oct 28 '25 15:10 atulkulk

@eidenyoshida this PR only adds unit tests, no effect on performance. Regression testing not required.