cuda-kat
cuda-kat copied to clipboard
Specialize functions with many reads/writes for sub-4-byte element types
We have many templated functions which make a (potentially) large number of reads or writes to memory, and therefore benefit from coalescing their memory operations. However, most, if not all of them are not specialized for element types below 4 bytes long, and are therefore slower than they might have been. Examples include copying, filling, appending to global memory etc.
We should add specializations for these cases.