libcudacxx
libcudacxx copied to clipboard
Soundness bugfix for barrier<thread_scope_block> on sm_70
For sm_70, barrier arrive has an optimization to "coalesce" all arrives with the same update to the same barrier into a single update performed by a "leader" thread.
This optimization is missing a release fence to establish cummulativity between all coalesced threads and the leader, before the leader performs the update.
@daniellustig could maybe review?
The cumulativity fix seems reasonable to me
@wmaxey ?
Thanks for review, David. Merging.