cupla
cupla copied to clipboard
add CUDA intrinsics
I have code which makes use of warpSize and __shfl_down. The latter may be impossible to implement with alpaka, but warpSize could be mapped to something of value, e.g. elemDim.
Which operation do you perform with warpsize? Warp is no layer of alpaka therefore you can use it with #ifdef on nvidia hardware and on all other hardware write your algorithm so that your warpsize is one.
I'm doing __shfl_down reduction using warpSize. I guess it's very CUDA specific anyway
/* reduce per warp (warpSize == 32 assumed) */
int constexpr cWarpSize = 32;
assert( cWarpSize == warpSize );
#pragma unroll
for ( int32_t warpDelta = cWarpSize / 2; warpDelta > 0; warpDelta /= 2)
localReduced = f( localReduced, __shfl_down( localReduced, warpDelta ) );
if ( threadIdx.x % cWarpSize == 0 )
atomicFunc( rdpResult, localReduced, f );
@psychocoderHPC should this one be closed?