Added Functional Tests for CSV Tuner Plugin
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal"
What were the changes?
Added a set of functional tests for the RCCL CSV tuner plugin with support for below collectives:
- AllReduce
- Broadcast
- Reduce
- AllGather
- ReduceScatter
Why were the changes made?
The tests are added to verify the behavior of CSV Tuner Plugin across different configurations, topologies, and collective operations.
How was the outcome achieved?
The tests cover below scenarios:
- Valid configuration with wildcards
- Valid configuration without wildcards
- No matching configuration found
- Configuration file with invalid / incorrect values
- Configuration file unsupported algo / proto config
- Single-node configuration
- Multi-node configuration
Additional Documentation:
The test suite includes a detailed README with setup instructions, test execution commands, and pytest markers for selective test running
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
I only have a few minor comments:
test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.
Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?
test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?
Thank you
@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.
I only have a few minor comments:
test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.
Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?
test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?
Thank you
@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.
I only have a few minor comments: test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests. Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected? test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ? Thank you
Thank you for the clarification, @Kapil-Shyam-Pawar. Looks good.
LGTM please get Nilesh's signoff on his question regarding README.