rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Enable MSCCL++ enabled UBR test for AllReduce, AllGather with TestBed::RunSimpleSweep

Open isaki001 opened this issue 9 months ago • 1 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Added unit tests for MSCCL++ AllGather and AllReduce in UBR mode.

Why were the changes made?
Previously these unit tests were using standalone routines and were not utilizing the TestBed infrastructure. Attempts at using TestBed::RunSimpleSweep caused a hang during MSCCL++ enabled ncclCommRegister

How was the outcome achieved?
Added input/output buffer registration, made AllocateMem non-blocking.

Additional Documentation:
MSCCL++ single-process mode is not supported in MSCCL++ and UT will fail unless UT_PROCESS_MASK is set to 2. This is why I use setenv/unsetenv in the scope of each added test.

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

isaki001 avatar Mar 20 '25 20:03 isaki001

Archiving this PR. Please remove noCI label when ready, or close this PR if not needed.

nileshnegi avatar Jul 24 '25 19:07 nileshnegi