rccl
rccl copied to clipboard
Fix synchronization in allreduce8Read kernel
Running kernel allreduce8Read across 64 vGPUs (in CPX mode) revealed synchronization bugs. The PR addresses them by:
- Synchronize threads before signaling that output (outChannels) are valid to guarantee ordering between all data writes in the block with corresponding signals.
Following changes are not affecting correctness:
- Don't synchronize outChannels every iteration.
- Synchronize input channels only once at the beginning of the kernel execution.
@nusislam does the fix look good to you. If so, I will resolve the merge conflict.
@nusislam does the fix look good to you. If so, I will resolve the merge conflict.
Could not build your branch, got build error apply the patch "error: corrupt patch at line 353"
@nusislam This branch should now build. Do you have any comments on it?
@nusislam This branch should now build. Do you have any comments on it?
I could not reproduce any issue running the existing allred8Read kernel on 64 GPUs.