rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Added Functional Tests for CSV Tuner Plugin

Open Kapil-Shyam-Pawar opened this issue 2 months ago • 4 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal"

What were the changes?
Added a set of functional tests for the RCCL CSV tuner plugin with support for below collectives:

  • AllReduce
  • Broadcast
  • Reduce
  • AllGather
  • ReduceScatter

Why were the changes made?
The tests are added to verify the behavior of CSV Tuner Plugin across different configurations, topologies, and collective operations.

How was the outcome achieved?
The tests cover below scenarios:

  • Valid configuration with wildcards
  • Valid configuration without wildcards
  • No matching configuration found
  • Configuration file with invalid / incorrect values
  • Configuration file unsupported algo / proto config
  • Single-node configuration
  • Multi-node configuration

Additional Documentation:
The test suite includes a detailed README with setup instructions, test execution commands, and pytest markers for selective test running

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

Kapil-Shyam-Pawar avatar Oct 08 '25 14:10 Kapil-Shyam-Pawar

I only have a few minor comments:

test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.

Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?

test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?

Thank you

venksubramd avatar Oct 15 '25 21:10 venksubramd

@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.

I only have a few minor comments:

test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests.

Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected?

test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ?

Thank you

Kapil-Shyam-Pawar avatar Oct 20 '25 15:10 Kapil-Shyam-Pawar

@venksubramd These are validation tests that can be run with a smaller number of processes (-np 4) to avoid generating large log files. We’ll be adding performance-related tests in a separate PR. I’ve also updated the multinode skip message as suggested.

I only have a few minor comments: test_broadcast.py: Some tests use mpirun with -np 4 whereas other tests use it with -np 8. Should we be running each of the tests with 4 processors and then with 8 processors? If there is a reason to choose 4 vs 8, it may be helpful to mention that in the tests. Is there something more we should be looking for in the log_content than "tuning configurations" or is the presence of that text is a strong indication that all worked as expected? test_allreduce.py: My apologies for nitpick: Should "pytest.skip(f"Multinode test requires at least 2 nodes, but..." include the word allreduce, like so: "pytest.skip(f"Multinode allreduce test requires at least 2 nodes, but" ? Thank you

Thank you for the clarification, @Kapil-Shyam-Pawar. Looks good.

venksubramd avatar Oct 20 '25 15:10 venksubramd

LGTM please get Nilesh's signoff on his question regarding README.

thananon avatar Oct 29 '25 14:10 thananon