cub
cub copied to clipboard
Incorrect semantics in grid_barrier.cuh
In "/cub/grid/grid_barrier.cuh" I noticed some incorrect semantics being used. A flag "d_sync" is used to communicate between threadblocks and enforce syncing. In lines 98 and 120, LOAD_CG operations are used to read from these variables, illustrated below.
while (ThreadLoad<LOAD_CG>(d_sync + peer_block) == 0)
{
__threadfence_block();
}
From NVIDIA documentation:
Cache operators on load or store instructions are treated as performance hints only.
The CG operation doesn't guarantee consistency. Instead, these should be changed to volatile or atomic operations which provide consistency guarantees. For example, an AtomicCAS based implementation could also be used, as follows:
while (atomicCAS(d_sync + peer_block, 1, 0) == 0);
__threadfence();
This has the added benefit of resetting the flag, making further code resetting the flag unnecessary.
Hello @AKKamath! Thank you for taking the time to report this.
It's not the only place with misused cache operators semantic in CUB. We have plans to clean up this by integrating the relaxed/release/acquire qualifiers. In some cases, it might be beneficial from the performance point of view (removing fences from spin-loops).
I'm afraid that atomicCAS
can't be used in this context. The purpose of the barrier primitive is to stop a thread until all other threads reach it. So by resetting the d_sync
, we'll release peer_block
while other blocks might not have reached the barrier, which contradicts the definition of the barrier. So we have to separate these stages: wait for all blocks to become ready and release these blocks. You might use atomicAdd(d_sync + peer_block, 0)
thought.
Yes, replacing these with relaxed/release/acquire seems like a good choice.
You're completely correct about atomicCAS, it completely erodes the point of the barrier.
Thank you for the atomicAdd recommendation.
@senior-zero could you create a new issue to describe migrating to libcu++ atomics and then close this issue in favor of that one?
Created a separate issue to track migration to libcu++ atomics. Closing this one. @AKKamath if there are any action items that I missed, please, feel free to reopen.