rccl Added a Process Isolated Test Runner

Details

Work item: Internal

What were the changes?
A lightweight C++ testing framework that executes Google Test cases in separate processes, ensuring complete isolation between tests.

Why were the changes made?
RCCL code extensively uses static variables for configuration caching (e.g., rcclSetP2pNetChunkSize). In traditional testing, static variables persist across test cases, causing tests to pollute each other's state and produce unreliable results. This is particularly problematic when testing:

Environment variable behavior
Architecture-specific logic
One-time initialization patterns
Configuration changes

How was the outcome achieved?

Uses fork() for process isolation
Hooked into rccl-UnitTestsFixtures binary

Additional Documentation:
Read test/README.md for more information.

Approval Checklist

Do not approve until these items are satisfied.

[ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

Oct 21 '25 16:10 atulkulk

Nice work on this one @atulkulk

I really like the builder pattern to configure the process, for example where a call to registerTest(...) is followed with .withEnvironment(), .withTimeout(), and so on.

Also, really nice documentation in ProcessIsolatedTestRunner.md. It was very useful and easy to follow. Thanks for this.

A few of questions:

In ProcessIsolatedTestRunner.md, in the "Why use Process Isolation?" section, this may be me misunderstanding, but,

setenv("NCCL_P2P_NET_CHUNKSIZE", "12345", 1); // Static variable gets set to 12345

says the static variable is being set, but I do not see how the p2pNetChunkSize variable is being set by that call. My lack of understanding comes from the fact that p2pNetChunkSize is being set to value returned by a calculateValue() call. I may be missing the link between createValue() and the setting of the above environment variable. Some clarification for the example in this section will be helpful.

Looking at the file RcclWrapTests.cpp:

In TEST(Rcclwrap, RcclUpdateThreadThreshold_UserEnvSet), the test is being registered with a call to registerTest, but I do not see a call to executeAllTests(). Is that required or does the test run without an explicit call to it. If it is the latter, it would be useful to clarify that in the documentation.

I really like the emphasis in the "Best Practices" section to call clear before any registration. When clear is called, if there are tests that were registered but not executed or registered by not completed, it would be useful to know that. May be an error code? Or may be a log message? It would be useful to consider the situation where tests were registered by were not executed.

Alternatively, if there is an approach to check if not all registered tests were executed will be helpful.

Oct 24 '25 14:10 venksubramd

Nice work on this one @atulkulk

I really like the builder pattern to configure the process, for example where a call to registerTest(...) is followed with .withEnvironment(), .withTimeout(), and so on.

Also, really nice documentation in ProcessIsolatedTestRunner.md. It was very useful and easy to follow. Thanks for this.

A few of questions:

In ProcessIsolatedTestRunner.md, in the "Why use Process Isolation?" section, this may be me misunderstanding, but,

setenv("NCCL_P2P_NET_CHUNKSIZE", "12345", 1); // Static variable gets set to 12345

says the static variable is being set, but I do not see how the p2pNetChunkSize variable is being set by that call. My lack of understanding comes from the fact that p2pNetChunkSize is being set to value returned by a calculateValue() call. I may be missing the link between createValue() and the setting of the above environment variable. Some clarification for the example in this section will be helpful.

done (elaborated examples in README.md)

Looking at the file RcclWrapTests.cpp:

In TEST(Rcclwrap, RcclUpdateThreadThreshold_UserEnvSet), the test is being registered with a call to registerTest, but I do not see a call to executeAllTests(). Is that required or does the test run without an explicit call to it. If it is the latter, it would be useful to clarify that in the documentation.

done

I really like the emphasis in the "Best Practices" section to call clear before any registration. When clear is called, if there are tests that were registered but not executed or registered by not completed, it would be useful to know that. May be an error code? Or may be a log message? It would be useful to consider the situation where tests were registered by were not executed.

Alternatively, if there is an approach to check if not all registered tests were executed will be helpful.

done

@venksubramd I have addressed all your comments.

Nov 05 '25 19:11 atulkulk

Nice changes. I especially like the warning for unexecuted tests when clear is called.

Nov 08 '25 20:11 venksubramd