libcudacxx
libcudacxx copied to clipboard
<cuda::barrier> should add try_wait
It can be beneficial to test the state of a barrier prior to needing the barrier to be resolved. This allows one to hide the latency of testing the barrier behind computation. This latency can be quite significant if the barrier is in device or system memory.
For example, instead of:
arrive(x)
compute1
compute2
wait(x) (usually issues a load which we wait on shortly after, exposing latency)
One could do:
arrive(x)
compute1
x = try_wait(x)
compute2
wait(x)
Where the final wait can test the value of x that was previously loaded overlapped with compute-2.