rccl icon indicating copy to clipboard operation
rccl copied to clipboard

[GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs

Open gilbertlee-amd opened this issue 1 month ago • 0 comments

Details

Adding rail-optimized tree support for MI3XX configurations with only 4 NICs per node

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Added a rail-optimized tree config to model_87 which is the one detected for MI3XX with 4NICs per node.

Why were the changes made?
This can potentially reduce some extra traffic beyond the first layer of NIC switches

How was the outcome achieved?
The work is simply a slight adjustment from the MI3XX 8NIC rail-optimized tree configuration.

Additional Documentation:
Validation was done through topology explorer and RCCL_OUTPUT_TREES output: Attached at images of the default trees built (RCCL_DISABLE_RAIL_TREES=1) vs after the change for a 4 node configuration. MI3XX_4NIC_DefaultTrees_4Nodes MI3XX_4NIC_RailOptimizedTrees_4Nodes It can be seen that in the second example, NIC transfers no longer jump rails (change color).

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

gilbertlee-amd avatar Nov 03 '25 22:11 gilbertlee-amd