amrex int overflow in ReduceOps.eval(Box, ...)

If the number of cells in a Box exceeds the maximum integer, eval wont work correctly and in my case would not loop over all elements without giving an error. A 64 bit index type should be used instead of int. https://github.com/AMReX-Codes/amrex/blob/34c0ae3949688215a02c4ee36c427524c7863f5b/Src/Base/AMReX_Reduce.H#L504 and https://github.com/AMReX-Codes/amrex/blob/34c0ae3949688215a02c4ee36c427524c7863f5b/Src/Base/AMReX_Reduce.H#L564

Jul 27 '23 16:07 AlexanderSinn

We have the same issue in ParalleFor. It wasn't really an issue before because GPUs used to not have so much memory. But now it has become an issue. Presumably switching to long will use more registers.

Jul 27 '23 16:07 WeiqunZhang

In my case there isn’t an allocation of the size of the box, I need to count how many particles need to be initialized. If the additional number of registers is a problem, one could split the box inside .eval() so that every chunk has less than max int number of cells. However, in my testing the number of registers usually only has a small impact compared to optimizing for memory access pattern.

For now, I can use

reduce_op.eval(
    domain_box.numPts(), reduce_data,
    [=] AMREX_GPU_DEVICE (amrex::Long idx) -> ReduceTuple
        {
        auto [i, j, k] = domain_box.atOffset3d(idx).arr;
        ...

Jul 27 '23 17:07 AlexanderSinn

We could add some assertions first before we decide what to do.

Jul 27 '23 17:07 WeiqunZhang

I looked at a few thigs in Compiler Explorer and it seems the main reason using a 64 bit icell would be slower and use more registers is the 64 bit integer division by lenxy and lenx. A100 does not have division assembly instructions so it needs to be emulated. For 64 bit this seems to be so bad that in the assembly it will check if it can use a 32 bit division instead (also emulated). Since the divisor is the same for all threads the CPU could help by precomputing some values to replace the division on the GPU with multiplication. I found this library that does that https://github.com/NVIDIA/cutlass/blob/2a9fa23e06b1fc0b7fab7a3e29ff1b17e325da7f/include/cutlass/fast_math.h#L404-L488 maybe we could do something similar. I haven't tested the performance but I could see this being faster than the current version using 32 bit division. It needs uint128_t for the precomputation so its not super easy to copy.

Aug 21 '23 12:08 AlexanderSinn