enable 224 channels
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal", or link to GitHub issue (if applicable).
What were the changes?
Enabled 256 channels
Why were the changes made?
RCCL cannot go past 128 channels
How was the outcome achieved?
Due to NCCL 2.22 sync, it was not enough to change MAXCHANNELS, as division by zero erros occured during enqueue; I had to set nMaxChannels in struct ncclTaskColl to use 16 bits instead of 8. Then, I had to also adjust the max-allowed kernel sizes since budget restrictions introduced in NCCL 2.22 were causing a hang when utilizing 256 channels.
Additional Documentation:
What else should the reviewer know?
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
Can you find out what caused the kern arg size to go over 4K? I would rather keep 4K which is exactly one page. "5K" may have unintended consequences. Please check with @BertanDogancay if there is an alternative method to reduce kern arg size.
Can you find out what caused the kern arg size to go over 4K? I would rather keep 4K which is exactly one page. "5K" may have unintended consequences. Please check with @BertanDogancay if there is an alternative method to reduce kern arg size.
It happens in the testBudget routine in enqueue.cc https://github.com/ROCm/rccl/blob/develop/src/enqueue.cc#L436
ssize_t batchBytes = nWorkBatches*sizeof(struct ncclDevWorkBatch);
nWorkBatches is 256 when using 256 channels, which leads to batchBytes being 4096.
Because budget->inArgsBytes is 4032 the budget check failed, causing cheduleCollTasksToPlan to return ncclSuccess without ever reaching the point where planner->nTasksColl is decremented.
This caused an infinite loop in ncclLaunchPrepare because a do-while loop condition was never reaching zero while (planner->nTasksColl + planner->nTasksP2p != 0);
I will check if I can fix without increasing the budget threshold.
@wenkaidu There doesn't seem to be a way to decrease the kernel args further and the batchBytes will be more than 4K with 256 channels. Do we want to strictly stay at 1 page? @BertanDogancay suggested two pages if it doesn't have any negative impact.
Do we really need 256 channels? @gilbertlee-amd should we only need multiple of 56, i.e. 56*4=224?
Do we still need this? If not, can we close?