libflatarray
libflatarray copied to clipboard
add short_vec implementation for CUDA
...to utilize float, float2, float4 (artity: WARP_SIZE * 4), double (arity WARP_SIZE), double2 (arity WARP_SIZE * 2) and corresponding load/store operations. needs benchmarks, obviously.
bonus points for using warp shuffle operations.
moved from https://bitbucket.org/gentryx/libflatarray/issues/12/add-short_vec-implementation-for-cuda