rccl icon indicating copy to clipboard operation
rccl copied to clipboard

enable 224 channels

Open isaki001 opened this issue 10 months ago • 4 comments

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
Enabled 256 channels

Why were the changes made?
RCCL cannot go past 128 channels

How was the outcome achieved?
Due to NCCL 2.22 sync, it was not enough to change MAXCHANNELS, as division by zero erros occured during enqueue; I had to set nMaxChannels in struct ncclTaskColl to use 16 bits instead of 8. Then, I had to also adjust the max-allowed kernel sizes since budget restrictions introduced in NCCL 2.22 were causing a hang when utilizing 256 channels.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

isaki001 avatar Feb 10 '25 15:02 isaki001

Can you find out what caused the kern arg size to go over 4K? I would rather keep 4K which is exactly one page. "5K" may have unintended consequences. Please check with @BertanDogancay if there is an alternative method to reduce kern arg size.

wenkaidu avatar Feb 10 '25 16:02 wenkaidu

Can you find out what caused the kern arg size to go over 4K? I would rather keep 4K which is exactly one page. "5K" may have unintended consequences. Please check with @BertanDogancay if there is an alternative method to reduce kern arg size.

It happens in the testBudget routine in enqueue.cc https://github.com/ROCm/rccl/blob/develop/src/enqueue.cc#L436

 ssize_t batchBytes = nWorkBatches*sizeof(struct ncclDevWorkBatch);

nWorkBatches is 256 when using 256 channels, which leads to batchBytes being 4096. Because budget->inArgsBytes is 4032 the budget check failed, causing cheduleCollTasksToPlan to return ncclSuccess without ever reaching the point where planner->nTasksColl is decremented. This caused an infinite loop in ncclLaunchPrepare because a do-while loop condition was never reaching zero while (planner->nTasksColl + planner->nTasksP2p != 0);

I will check if I can fix without increasing the budget threshold.

isaki001 avatar Feb 10 '25 16:02 isaki001

@wenkaidu There doesn't seem to be a way to decrease the kernel args further and the batchBytes will be more than 4K with 256 channels. Do we want to strictly stay at 1 page? @BertanDogancay suggested two pages if it doesn't have any negative impact.

isaki001 avatar Feb 10 '25 20:02 isaki001

Do we really need 256 channels? @gilbertlee-amd should we only need multiple of 56, i.e. 56*4=224?

wenkaidu avatar Feb 10 '25 22:02 wenkaidu

Do we still need this? If not, can we close?

thananon avatar Jul 11 '25 20:07 thananon