cub icon indicating copy to clipboard operation
cub copied to clipboard

Incorrect semantics in grid_barrier.cuh

Open AKKamath opened this issue 3 years ago • 2 comments

In "/cub/grid/grid_barrier.cuh" I noticed some incorrect semantics being used. A flag "d_sync" is used to communicate between threadblocks and enforce syncing. In lines 98 and 120, LOAD_CG operations are used to read from these variables, illustrated below.

while (ThreadLoad<LOAD_CG>(d_sync + peer_block) == 0)    
{    
    __threadfence_block();    
}    

From NVIDIA documentation:

Cache operators on load or store instructions are treated as performance hints only.

The CG operation doesn't guarantee consistency. Instead, these should be changed to volatile or atomic operations which provide consistency guarantees. For example, an AtomicCAS based implementation could also be used, as follows:

while (atomicCAS(d_sync + peer_block, 1, 0) == 0);
__threadfence();

This has the added benefit of resetting the flag, making further code resetting the flag unnecessary.

AKKamath avatar Aug 11 '21 16:08 AKKamath

Hello @AKKamath! Thank you for taking the time to report this.

It's not the only place with misused cache operators semantic in CUB. We have plans to clean up this by integrating the relaxed/release/acquire qualifiers. In some cases, it might be beneficial from the performance point of view (removing fences from spin-loops).

I'm afraid that atomicCAS can't be used in this context. The purpose of the barrier primitive is to stop a thread until all other threads reach it. So by resetting the d_sync, we'll release peer_block while other blocks might not have reached the barrier, which contradicts the definition of the barrier. So we have to separate these stages: wait for all blocks to become ready and release these blocks. You might use atomicAdd(d_sync + peer_block, 0) thought.

gevtushenko avatar Aug 12 '21 22:08 gevtushenko

Yes, replacing these with relaxed/release/acquire seems like a good choice.

You're completely correct about atomicCAS, it completely erodes the point of the barrier.

Thank you for the atomicAdd recommendation.

AKKamath avatar Aug 13 '21 12:08 AKKamath

@senior-zero could you create a new issue to describe migrating to libcu++ atomics and then close this issue in favor of that one?

jrhemstad avatar Aug 22 '22 16:08 jrhemstad

Created a separate issue to track migration to libcu++ atomics. Closing this one. @AKKamath if there are any action items that I missed, please, feel free to reopen.

gevtushenko avatar Aug 25 '22 08:08 gevtushenko