cub
cub copied to clipboard
Explicitly document synchronization requirements in Warp-level APIs
For all warp-based cub api, say warpscan, the example given by the document do not use __syncwarp to sync threads within a warp.
However, it seems that in volta, threads within a warp are no longer synchronized naturally and seems __syncwarp is required before and after warp-based communication.
So, need I use __syncwarp for warpscan just like use __syncthreads in blockscan?
Thanks!
Just to close the loop, a final sync is not needed, and the shfl-based implementation will use the shfl.sync
instructions when required for a given arch. Leaving this open as a reminder to address this in our next documentation push.