rccl Added Functional Tests for CSV Tuner Plugin

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal"

What were the changes?
Added a set of functional tests for the RCCL CSV tuner plugin with support for below collectives:

AllReduce
Broadcast
Reduce
AllGather
ReduceScatter

Why were the changes made?
The tests are added to verify the behavior of CSV Tuner Plugin across different configurations, topologies, and collective operations.

How was the outcome achieved?
The tests cover below scenarios:

Valid configuration with wildcards
Valid configuration without wildcards
No matching configuration found
Configuration file with invalid / incorrect values
Configuration file unsupported algo / proto config
Single-node configuration
Multi-node configuration

Additional Documentation:
The test suite includes a detailed README with setup instructions, test execution commands, and pytest markers for selective test running

Approval Checklist

Do not approve until these items are satisfied.

[ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

Oct 08 '25 14:10 Kapil-Shyam-Pawar

I only have a few minor comments:

test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.

Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?

test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?

Thank you

Oct 15 '25 21:10 venksubramd

@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.

I only have a few minor comments:

test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.

Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?

test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?

Thank you

Oct 20 '25 15:10 Kapil-Shyam-Pawar

@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.

I only have a few minor comments: test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests. Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected? test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ? Thank you

Thank you for the clarification, @Kapil-Shyam-Pawar. Looks good.

Oct 20 '25 15:10 venksubramd

LGTM please get Nilesh's signoff on his question regarding README.

Oct 29 '25 14:10 thananon