chapel
chapel copied to clipboard
GPU-based reductions with dynamic block sizes don't compile
var Arr = [1,2,3];
var sum = 0;
proc foo(x) do return x;
@gpu.blockSize(foo(128))
forall a in Arr with (+ reduce sum) {
sum += a;
}
writeln(sum);
results in $CHPL_HOME/reduceTest.chpl:8: internal error: unable to find function chpl_gpu_dev_sum_breduce_int64_t
, which is because our device-side functions are specialized for block size. IOW, they are named chpl_gpu_dev_sum_breduce_int64_t_128
, for example.
I realized this while looking at the compiler code for https://github.com/chapel-lang/chapel/pull/25738. I believe I can put a more meaningful error message in that PR as a short-term solution. I am not sure if we want to support reductions with dynamically sized kernels. We can think of an implementation that will perform poorly. Such an implementation could use a 1024-thread specialization with only a set number of threads active per block. Which is I believe supported by CUB, but I haven't used it to be certain.