rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Added MPI test runner for executing unit/functional tests

Open atulkulk opened this issue 2 months ago • 5 comments

Details

Work item: Internal

What were the changes?
A simple C++ testing framework for multi-process RCCL testing using MPI and Google Test integration.

Why were the changes made?
Testing RCCL at unit and functional levels requires multiple processes with GPU-to-GPU data transfer to properly validate collective operations (AllReduce, Broadcast), point-to-point communication, and even specific features such as transport mechanism, etc. Features

  • Multi-process distributed testing: Each MPI rank runs in its own process with dedicated GPU
  • Automatic RCCL communicator management: Test-specific communicators for complete isolation between tests
  • Multi-node support: Works seamlessly on single-node (shared memory) or multi-node (network) configurations
  • Google Test integration: Familiar test patterns with TEST_F and assertion macro
  • Added provision to include non-GTest (standalone) tests

How was the outcome achieved?

  • MPITestBase: Base class providing test infrastructure (located in test/common/MPITestBase.hpp/cpp)
  • MPIEnvironment: Global MPI environment for initialization and cleanup

Additional Documentation:
Read test/README.md for more information.

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

atulkulk avatar Oct 21 '25 23:10 atulkulk

I think this one is pretty solid. I really like the documentation and the descriptions as well.

One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?

venksubramd avatar Oct 24 '25 16:10 venksubramd

I think this one is pretty solid. I really like the documentation and the descriptions as well.

One unrelated observation, going over the tests in P2pMPITests.cpp that use the MPI test running: There are several places where we are cleaning up memory (as we should) but that's happening after ASSERT_EQ. If the assert were to fail, these clean up steps would not run. Do we need to revisit these and ensure that we have proper clean up irrespective of the termination of a test due to assertion failures?

@venksubramd I have addressed your comment. Added resource guards to handle cleaning up of the memory.

atulkulk avatar Oct 28 '25 15:10 atulkulk

@eidenyoshida this PR only adds unit tests, no effect on performance. Regression testing not required.

corey-derochie-amd avatar Oct 31 '25 14:10 corey-derochie-amd

@eidenyoshida this PR only adds unit tests, no effect on performance. Regression testing not required.

@corey-derochie-amd this PR was randomly selected just for testing the regression detector pipeline.

nileshnegi avatar Oct 31 '25 15:10 nileshnegi

regression-detection run on commit c8b6353d420da73ad7390596872ec62ce7b2d369

Artifacts - Results

ROCmMathLibrariesBot avatar Oct 31 '25 16:10 ROCmMathLibrariesBot

regression-detection run on commit 6244f1d45b37b3456a120790a1b6cd9fc858227f

Artifacts - Results

ROCmMathLibrariesBot avatar Nov 06 '25 22:11 ROCmMathLibrariesBot

regression-detection run on commit df578f85f8dddd4b3c64555f0c46f0e1ab268da0

Artifacts - Results

ROCmMathLibrariesBot avatar Nov 06 '25 23:11 ROCmMathLibrariesBot

regression-detection run on commit c999a1fc069876179476cc4149a17d823cbf523b

Artifacts - Results

ROCmMathLibrariesBot avatar Nov 08 '25 15:11 ROCmMathLibrariesBot

Nice work, addresses my previous comments. Thanks.

venksubramd avatar Nov 08 '25 21:11 venksubramd

regression-detection run on commit 0dbe259c980b22f509f4f21056ef65af2cda3659

Artifacts - Results

ROCmMathLibrariesBot avatar Nov 11 '25 06:11 ROCmMathLibrariesBot

regression-detection run on commit cc4946fb11348290daf2256eced002e6ae7d19cd

Artifacts - Results

ROCmMathLibrariesBot avatar Nov 15 '25 05:11 ROCmMathLibrariesBot