rocPRIM `rocprim::block_load` fails to instantiate

Describe the bug Code using block_load fails to compile with 2 issues

/opt/rocm-5.7.0/include/rocprim/block/block_load.hpp:776:5: error: static assertion failed due to requirement 'BlockSize % ::rocprim::device_warp_size() == 0': BlockSize must be a multiple of hardware warpsize
    static_assert(BlockSize % ::rocprim::device_warp_size() == 0,

/opt/rocm-5.7.0/include/hipcub/block/../backend/rocprim/block/block_load.hpp:105:25: error: no type named 'storage_type' in 'rocprim::block_load<float, 32, 4, rocprim::block_load_method::block_load_warp_transpose>'
    typename base_type::storage_type& temp_storage_;

To Reproduce Steps to reproduce the behavior: I used Docker images (ROCm 5.7 and 6.0) with the same errors:

rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
rocm/pytorch:latest

Compile the following code: hipcc $(hipconfig -C) -c bug.hip

#include <hipcub/block/block_load.hpp>

__global__ void kernel(float *arr, int n)
{
    int tid = blockDim.x*blockIdx.x + threadIdx.x;
    if (tid < n)
        arr[tid] = arr[tid] * arr[tid];

    hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE> a;
}

int main() { return 0; }

Expected behavior I expect code to compile without errors/warnings

Environment Make sure that ROCm is correctly installed and run the following command:

printf '=== environment\n' > environment.txt && 
printf '\n\n=== date\n' >> environment.txt && date >> environment.txt && 
printf '\n\n=== Linux Kernel\n' >> environment.txt && uname -a  >> environment.txt && 
printf '\n\n=== rocm-smi' >> environment.txt && rocm-smi  >> environment.txt && 
printf '\n\n' >> environment.txt && hipconfig  >> environment.txt && 
printf '\n\n=== rocminfo\n' >> environment.txt && rocminfo  >> environment.txt && 
printf '\n\n=== lspci VGA\n' >> environment.txt && lspci | grep -i vga >> environment.txt

Attach environment.txt

environment.txt

Errors from compilation*

In file included from /src/mamba/rocm/bug1.hip:1:
In file included from /opt/rocm/include/hipcub/block/block_load.hpp:34:
In file included from /opt/rocm/include/hipcub/block/../backend/rocprim/block/block_load.hpp:37:
/opt/rocm/include/rocprim/block/block_load.hpp:776:5: error: static assertion failed due to requirement 'BlockSize % ::rocprim::device_warp_size() == 0': BlockSize must be a multiple of hardware warpsize
    static_assert(BlockSize % ::rocprim::device_warp_size() == 0,
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm/include/hipcub/block/../backend/rocprim/block/block_load.hpp:80:15: note: in instantiation of template class 'rocprim::block_load<float, 32, 4, rocprim::block_load_method::block_load_warp_transpose>' requested here
    : private ::rocprim::block_load<
              ^
/src/mamba/rocm/bug1.hip:9:72: note: in instantiation of template class 'hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE>' requested here
    hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE> a;
                                                                       ^
In file included from /src/mamba/rocm/bug1.hip:1:
In file included from /opt/rocm/include/hipcub/block/block_load.hpp:34:
/opt/rocm/include/hipcub/block/../backend/rocprim/block/block_load.hpp:105:25: error: no type named 'storage_type' in 'rocprim::block_load<float, 32, 4, rocprim::block_load_method::block_load_warp_transpose>'
    typename base_type::storage_type& temp_storage_;
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/src/mamba/rocm/bug1.hip:9:72: note: in instantiation of template class 'hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE>' requested here
    hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE> a;
                                                                       ^
In file included from /src/mamba/rocm/bug1.hip:1:
In file included from /opt/rocm/include/hipcub/block/block_load.hpp:34:
/opt/rocm/include/hipcub/block/../backend/rocprim/block/block_load.hpp:108:45: error: no type named 'storage_type' in 'rocprim::block_load<float, 32, 4, rocprim::block_load_method::block_load_warp_transpose>'
    using TempStorage = typename base_type::storage_type;
                        ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
3 errors generated when compiling for host.

Mar 23 '24 20:03 r2d3

So I found a workaround (using __HIP_DEVICE_COMPILE__ as this seems to be a hipcc compiler bug.

#include <hipcub/block/block_load.hpp>

__global__ void kernel(float *arr, int n)
{
#if __HIP_DEVICE_COMPILE__
    int tid = blockDim.x*blockIdx.x + threadIdx.x;
    if (tid < n)
        arr[tid] = arr[tid] * arr[tid];

    hipcub::BlockLoad<float, 32, 4, hipcub::BLOCK_LOAD_WARP_TRANSPOSE> a;
#endif
}

int main() { return 0; }

Mar 25 '24 07:03 r2d3

Hi @r2d3 Thanks for creating the ticket. I took a look at your issue, and I think the main problem likely lies in that you're using an unsupported GPU architecture (gfx1011). Please refer to the GPU support matrix at https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html.

From what I can tell, aside from the unsupported GPU, the hipCUB call looks correct - you specified a warp/wavefront size of 32 in the hipcub::BlockLoad template parameter, which does match what rocminfo says is the wavefront size in your environment.txt output. Out of curiosity, if you just do something like printf("WarpSize: %d\n", warpSize); inside your kernel, what is actually reported at runtime on your gfx1011? I don't have a gfx1011 to check, but I suspect it might be 64. If I change BlockSize parameter from 32 to 64 in your code, it succesfully compiles for me.

The fact that you also had to put in #if HIP_DEVICE_COMPILE in your kernel also is suspicious - HIP_DEVICE_COMPILE should always be on by default when in device code. I suspect hipcc doesn't properly compile for gfx1011, as again it is not technically supported.

Apr 01 '24 16:04 stanleytsang-amd

Hi @stanleytsang-amd

OK gfx1011 are the AMD machines (g4ad) provided by AWS.

But even when compiling for gfx1030 and gfx1100, I got the same compilation error.

warpSize on gfx1011 returns 64. But as you say, device code should only be compiled on the device pass.

So it is more a compiler issue than a rocPRIM / hipCUB issue. Where I should file this compiler issue ?

At least, HIP_DEVICE_COMPILE allows me to compile in the Docker image on any machine.

Thanks for looking at my report.

Apr 01 '24 17:04 r2d3

Ah, yes, I think you've nailed it - the compiler behavior is unexpected to me too, even with a supported GPU as you tried out. Let me talk to the compiler team internally and I'll get back to you.

Apr 01 '24 20:04 stanleytsang-amd

Hi @r2d3 I spoke with the compiler team, and bottom line is, this is probably something rocPRIM will have to fix. As noted in the error log static_assert(BlockSize % ::rocprim::device_warp_size() == 0 fails when the host compiler looks at the kernel code during compilation - this is the compiler working as intended.
The host compiler does not "skip" device and global functions. ::rocprim::device_warp_size() merely returns the warpSize intrinsic (which in turn is defined as __AMDGCN_WAVEFRONT_SIZE), which is problematic as when the host compiler sees it, there's no way for the host compiler to get the correct value for the GPU architecture we're targeting. It's probably hardcoded to 64 for the host, which is why the static_assert fails in your example.

Using __HIP_DEVICE_COMPILE__ as you did in your example to guard all the static_assert checks in rocPRIM might be the easiest way to fix this. @mfep @Naraenda @nolmoonen do you have any thoughts on this? I also can't recall why the BlockSize check was written to be a compile-time check in the first place.

Apr 01 '24 22:04 stanleytsang-amd

Hi, that seems like a reasonable fix. We're tracking this internally and will create a seperate PR for this.

Apr 02 '24 13:04 Naraenda

Thanks @Naraenda! A more detailed explanation from the compiler team about why this is needed:

The compiler does separate compilation for host and each GPU arch. During each compilation the compiler parses and does semantic check for both host code and device code in each compilation, this is because they affect each other, e.g. template instantiation in device code affecting what kernels are available in host code. HIP program has to be parsed as a whole. The difference of these compilations is that they emit LLVM IR for different target.

In host compilation, since the compiler needs to parse device code, and there may be multiple GPU archs, the compiler assumes the default GPU arch gfx906 for device code. It makes an assumption, that is, the emitted LLVM IR for host does not depend on what GPU arch is assumed for host compilation. In general this is true, since in most cases the emitted LLVM IR for host is only affected by the signature of kernels, which does not depend on what GPU arch is. However, if you have a static_assert that is true only for device compilation, it should be guarded by HIP_DEVICE_COMPILE.

Apr 02 '24 16:04 stanleytsang-amd

Thanks @Naraenda! A more detailed explanation from the compiler team about why this is needed:

The compiler does separate compilation for host and each GPU arch. During each compilation the compiler parses and does semantic check for both host code and device code in each compilation, this is because they affect each other, e.g. template instantiation in device code affecting what kernels are available in host code. HIP program has to be parsed as a whole. The difference of these compilations is that they emit LLVM IR for different target.

In host compilation, since the compiler needs to parse device code,

For me, compiler does not need to parse device code, just the __device__/__global__ function "prototype" not their body.

NVIDIA compiler (nvcc) works this way.

Apr 04 '24 15:04 r2d3