cupla add CUDA intrinsics

I have code which makes use of warpSize and __shfl_down. The latter may be impossible to implement with alpaka, but warpSize could be mapped to something of value, e.g. elemDim.

Feb 28 '16 01:02 mxmlnkn

Which operation do you perform with warpsize? Warp is no layer of alpaka therefore you can use it with #ifdef on nvidia hardware and on all other hardware write your algorithm so that your warpsize is one.

Feb 28 '16 07:02 psychocoderHPC

I'm doing __shfl_down reduction using warpSize. I guess it's very CUDA specific anyway

    /* reduce per warp (warpSize == 32 assumed) */
    int constexpr cWarpSize = 32;
    assert( cWarpSize == warpSize );
    #pragma unroll
    for ( int32_t warpDelta = cWarpSize / 2; warpDelta > 0; warpDelta /= 2)
        localReduced = f( localReduced, __shfl_down( localReduced, warpDelta ) );

    if ( threadIdx.x % cWarpSize == 0 )
        atomicFunc( rdpResult, localReduced, f );

Feb 28 '16 10:02 mxmlnkn

@psychocoderHPC should this one be closed?

Feb 04 '20 12:02 sbastrakov