amrex icon indicating copy to clipboard operation
amrex copied to clipboard

`__AMDGCN_WAVEFRONT_SIZE` only has expected value during device pass

Open mrowan137 opened this issue 1 year ago • 5 comments

This is in reference to a previous pull request #3777. The __AMDGCN_WAVEFRONT_SIZE only has the expected (correct) value when referenced in the device pass of compilation, and is left at a default value 64 when referenced in the host pass. Unfortunately, it is apparently nontrivial to expose the device-pass value during the host pass through this macro, and I don't have a suggestion from compiler folks about exposing this value at compile time another way. They aim to add a warning when the host-pass detects this macro is used, but as for compile-time value I'm not sure I have any good idea. Manual selection throughifdefing on the architecture could be inconvenient to maintain.

mrowan137 avatar Mar 07 '24 04:03 mrowan137

@mrowan137 Thanks for looking into this! I guess we will set the wavefront size at compile time with our own macro. For now it's 32 for gfx10?? and 64 for all others.

https://github.com/llvm/llvm-project/blob/efc063b621ea0c4d1e452bcade62f7fc7e1cc937/clang/test/Driver/amdgpu-macros.cl#L70-L115

@psychocoderhpc

WeiqunZhang avatar Mar 07 '24 05:03 WeiqunZhang

We might have a way to detect at compile time if the macro we set is correct.

__host__ __device__ void foo ()
{
#ifdef __HIP_DEVICE_COMPILE__
    static_assert(__AMDGCN_WAVEFRONT_SIZE == AMREX_AMDGCN_WAVEFRONT_SIZE);
#endif
}

$ hipcc -DAMREX_AMDGCN_WAVEFRONT_SIZE=64 --offload-arch=gfx90a -c foo.cpp

$ hipcc -DAMREX_AMDGCN_WAVEFRONT_SIZE=32 --offload-arch=gfx90a -c foo.cpp
foo.cpp:5:5: error: static assertion failed due to requirement '64 == 32'
    static_assert(__AMDGCN_WAVEFRONT_SIZE == AMREX_AMDGCN_WAVEFRONT_SIZE);
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx90a.

$ hipcc -DAMREX_AMDGCN_WAVEFRONT_SIZE=64 --offload-arch=gfx1010 -c foo.cpp
foo.cpp:5:5: error: static assertion failed due to requirement '32 == 64'
    static_assert(__AMDGCN_WAVEFRONT_SIZE == AMREX_AMDGCN_WAVEFRONT_SIZE);
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx1010.

$ hipcc -DAMREX_AMDGCN_WAVEFRONT_SIZE=32 --offload-arch=gfx1010 -c foo.cpp

WeiqunZhang avatar Mar 07 '24 06:03 WeiqunZhang

@WeiqunZhang Thanks for pinging me in this issue:

In our projects, we use this macro if caste for the detection

https://github.com/alpaka-group/mallocMC/blob/bffe2aa2da5e83d356ff8d32f392935b8f7a59fa/src/include/mallocMC/mallocMC_utils.hpp#L74C7-L87

where MALLOCMC_DEVICE_COMPILE is the detection if we are in the host or device compile pass. The only clean device pass detection I found was

https://github.com/alpaka-group/mallocMC/blob/bffe2aa2da5e83d356ff8d32f392935b8f7a59fa/src/include/mallocMC/mallocMC_utils.hpp#L53

__HIP_DEVICE_COMPILE__ is not enough to address the device pass only, therefore, I always check additionally for __HIP_ARCH_HAS_WARP_BALLOT__, which is available in all modern AMD GPUs.

psychocoderHPC avatar Mar 08 '24 08:03 psychocoderHPC

The following example works fine if you do not use static_assert because due to a strange parsing stage in the compiler, the device functions are parsed but without setting __HIP_DEVICE_COMPILE__.

#include <hip/hip_runtime.h>
#include <iostream>


#if(defined(__HIP_ARCH_HAS_WARP_BALLOT__) || defined(__CUDA_ARCH__) || __HIP_DEVICE_COMPILE__ == 1)
#    define MALLOCMC_DEVICE_COMPILE 1
#endif

#if defined(__CUDA_ARCH__)
    static constexpr auto myWarpSize = 32; // TODO
#elif(MALLOCMC_DEVICE_COMPILE)
// defined:
// https://github.com/llvm/llvm-project/blob/62ec4ac90738a5f2d209ed28c822223e58aaaeb7/clang/lib/Basic/Targets/AMDGPU.cpp#L400
// overview wave front size:
// https://github.com/llvm/llvm-project/blob/efc063b621ea0c4d1e452bcade62f7fc7e1cc937/clang/test/Driver/amdgpu-macros.cl#L70-L115
// gfx10XX has 32 threads per wavefront else 64
#    if(HIP_VERSION_MAJOR >= 4)
    static constexpr auto myWarpSize = __AMDGCN_WAVEFRONT_SIZE;
#    else
    static constexpr auto myWarpSize = 64;
#    endif
#else
    static constexpr auto myWarpSize = 1;
#endif

__host__ __device__ void foo()
{
    //static_assert(myWarpSize == EXTERN_DEFINED_WAVE_SIZE);
    constexpr int detectedWarpSize = myWarpSize;
    printf("device warp size: %i\n",(int)(detectedWarpSize));
}

__global__ void fooKernel()
{
        foo();
}

int main()
{

        fooKernel<<<1,1>>>();
        hipDeviceSynchronize();
        std::cout<<"host warp size:   "<<__AMDGCN_WAVEFRONT_SIZE<<std::endl;

        return 0;
}
hipcc main.cpp -std=c++17 -DEXTERN_DEFINED_WAVE_SIZE=32

result on a AMD GPU with wave size 32.

./a.out 
device warp size: 32
host warp size:   64

psychocoderHPC avatar Mar 08 '24 08:03 psychocoderHPC

Sorry I have not carefully enough read the initial comment. Getting the warp size at compile time looks currently not possible.

psychocoderHPC avatar Mar 08 '24 09:03 psychocoderHPC

@WeiqunZhang @ax3l Issuance of Clang warning is implemented here: https://github.com/llvm/llvm-project/pull/91478

mrowan137 avatar May 16 '24 16:05 mrowan137