cutlass
cutlass copied to clipboard
[FEA] BFloat16x2 Atomics
Currently, CUTLASS only implements a specialization of atomic_add
for half2
, but not nv_bfloat162
. This in turn limits BlockStripedReduce to specialize in half2
but not nv_bfloat162
.
Is there any reason not to provide a specialization for nv_bfloat162
? It looks like a very simple change, but maybe I'm missing something. Thanks in advance for the help!