chapel GPU-based reductions with dynamic block sizes don't compile

GPU-based reductions with dynamic block sizes don't compile

Open e-kayrakli opened this issue 6 months ago • 0 comments

var Arr = [1,2,3];

var sum = 0;

proc foo(x) do return x;

@gpu.blockSize(foo(128))
forall a in Arr with (+ reduce sum) {
  sum += a;
}

writeln(sum);

results in $CHPL_HOME/reduceTest.chpl:8: internal error: unable to find function chpl_gpu_dev_sum_breduce_int64_t, which is because our device-side functions are specialized for block size. IOW, they are named chpl_gpu_dev_sum_breduce_int64_t_128, for example.

I realized this while looking at the compiler code for https://github.com/chapel-lang/chapel/pull/25738. I believe I can put a more meaningful error message in that PR as a short-term solution. I am not sure if we want to support reductions with dynamically sized kernels. We can think of an implementation that will perform poorly. Such an implementation could use a 1024-thread specialization with only a set number of threads active per block. Which is I believe supported by CUB, but I haven't used it to be certain.

Aug 10 '24 00:08 e-kayrakli

chapel chapel copied to clipboard

GPU-based reductions with dynamic block sizes don't compile

chapel
chapel copied to clipboard