burn Performance WGPU: Improve Reduce kernels

The reduction of dimension is not fully parallelized, since one thread reduces a whole dimension alone with a for loop. To improve the performance of shaders, we should use a tree-based approach similar to what is described in the Nvidia CUDA documentation (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf).

Reduce one dimension

[ ] Sum dim
[ ] Mean dim
[ ] Argmax
[ ] Argmin

Reduce full tensor to a scalar

[ ] Sum

Jul 25 '23 17:07 louisfd

I can take it.

I think we don't have any benchmark for the reduction of dimension, so I guess it makes sense to add a separate task for it

Aug 03 '23 16:08 mmalczak

Hey there! I'm interested in this issue as well. I've crudely implemented Mark Harris' algorithm for parallel reduction with WGSL and wgpu, and I'm doing some experiments/benchmarks to improve performance. Here's my github repo in which I perform the experiments.

I'm in very early experimentation, and I've only implemented sum-reduce with f32 in 1D. For now, one result that I got with my hardware (a laptop with an NVIDIA RTX 2060) is that my implementation is about 1.66x faster with an input size of 2^26 (i.e. 67,108,864 elements). Here's the plot generated with Criterion for three different input sizes (2^24, 2^25, 2^26) and comparing between CPU, Burn and my own wgpu implementation:

The above plot is overly optimistic and cherry-picked, evidently. Another benchmark, done from input sizes ranging from 2^20 to 2^28 (the max size for my GPU apparently) shows a less optimistic picture:

For maximum reproducibility, the benchmarks above were done under the implementation of this commit in my repo.

I believe this improvement over a narrow range can be explained by the fact that my implementation chooses a "work per thread" size of $k \cdot 2^{\lceil \log_2 {\lceil \log_2(n) \rceil} \rceil} $, i.e. $k$ times the smallest power of two greater than or equal to $\lceil \log_2(n) \rceil$, where $n$ is the input size (as the number of elements). In the benchmark above, I use $k=8$.

My rationale was based on Mark Harris' slides that @louisfd linked above, in which he says: "Brent’s theorem says each thread should sum $O(\log n)$ elements [but] In my experience, [it is] beneficial to push it even further". He cites three reasons for this:

Possibly better latency hiding with more work per thread
More threads per block reduces levels in tree of recursive kernel invocations
High kernel launch overhead in last levels with few blocks

These are the parameters for n=2^24 to 2^28 elements w/ my implementation:

n	workgroup_size	work_per_thread	workgroups
2^20 = 1048576	1024	256	4
2^21 = 2097152	1024	256	8
2^22 = 4194304	1024	256	16
2^23 = 8388608	1024	256	32
2^24 = 16777216	1024	256	64
2^25 = 33554432	1024	256	128
2^26 = 67108864	1024	256	256
2^27 = 134217728	1024	256	512
2^28 = 268435456	1024	256	1024

Help on this is greatly appreciated! I'm still a beginner with GPGPU and I definitely missed many optimizations mentioned by the CUDA webinar.

Jan 10 '24 19:01 vini-fda

@louisfd has worked on this issue in the past month. I think we now have two different algorithms for sum-dim and mean-dim. Not sure we have something new for sum and mean.

We also have benchmarks in burn-wgpu/benches/reduction.rs, so you could compare the new algorithms.

Jan 11 '24 14:01 nathanielsimard

Oh great! I hadn't seen it. I'll look into it ASAP

Jan 12 '24 01:01 vini-fda

burn burn copied to clipboard

Performance WGPU: Improve Reduce kernels

Reduce one dimension

Reduce full tensor to a scalar

burn
burn copied to clipboard