burn icon indicating copy to clipboard operation
burn copied to clipboard

Performance WGPU: Improve Reduce kernels

Open louisfd opened this issue 1 year ago • 4 comments

The reduction of dimension is not fully parallelized, since one thread reduces a whole dimension alone with a for loop. To improve the performance of shaders, we should use a tree-based approach similar to what is described in the Nvidia CUDA documentation (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf).

Reduce one dimension

  • [ ] Sum dim
  • [ ] Mean dim
  • [ ] Argmax
  • [ ] Argmin

Reduce full tensor to a scalar

  • [ ] Sum

louisfd avatar Jul 25 '23 17:07 louisfd

I can take it.

I think we don't have any benchmark for the reduction of dimension, so I guess it makes sense to add a separate task for it

mmalczak avatar Aug 03 '23 16:08 mmalczak

Hey there! I'm interested in this issue as well. I've crudely implemented Mark Harris' algorithm for parallel reduction with WGSL and wgpu, and I'm doing some experiments/benchmarks to improve performance. Here's my github repo in which I perform the experiments.

I'm in very early experimentation, and I've only implemented sum-reduce with f32 in 1D. For now, one result that I got with my hardware (a laptop with an NVIDIA RTX 2060) is that my implementation is about 1.66x faster with an input size of 2^26 (i.e. 67,108,864 elements). Here's the plot generated with Criterion for three different input sizes (2^24, 2^25, 2^26) and comparing between CPU, Burn and my own wgpu implementation:

image

The above plot is overly optimistic and cherry-picked, evidently. Another benchmark, done from input sizes ranging from 2^20 to 2^28 (the max size for my GPU apparently) shows a less optimistic picture:

image

For maximum reproducibility, the benchmarks above were done under the implementation of this commit in my repo.

I believe this improvement over a narrow range can be explained by the fact that my implementation chooses a "work per thread" size of $k \cdot 2^{\lceil \log_2 {\lceil \log_2(n) \rceil} \rceil} $, i.e. $k$ times the smallest power of two greater than or equal to $\lceil \log_2(n) \rceil$, where $n$ is the input size (as the number of elements). In the benchmark above, I use $k=8$.

My rationale was based on Mark Harris' slides that @louisfd linked above, in which he says: "Brent’s theorem says each thread should sum $O(\log n)$ elements [but] In my experience, [it is] beneficial to push it even further". He cites three reasons for this:

  1. Possibly better latency hiding with more work per thread
  2. More threads per block reduces levels in tree of recursive kernel invocations
  3. High kernel launch overhead in last levels with few blocks

These are the parameters for n=2^24 to 2^28 elements w/ my implementation:

n workgroup_size work_per_thread workgroups
2^20 = 1048576 1024 256 4
2^21 = 2097152 1024 256 8
2^22 = 4194304 1024 256 16
2^23 = 8388608 1024 256 32
2^24 = 16777216 1024 256 64
2^25 = 33554432 1024 256 128
2^26 = 67108864 1024 256 256
2^27 = 134217728 1024 256 512
2^28 = 268435456 1024 256 1024

Help on this is greatly appreciated! I'm still a beginner with GPGPU and I definitely missed many optimizations mentioned by the CUDA webinar.

vini-fda avatar Jan 10 '24 19:01 vini-fda

@louisfd has worked on this issue in the past month. I think we now have two different algorithms for sum-dim and mean-dim. Not sure we have something new for sum and mean.

We also have benchmarks in burn-wgpu/benches/reduction.rs, so you could compare the new algorithms.

nathanielsimard avatar Jan 11 '24 14:01 nathanielsimard

Oh great! I hadn't seen it. I'll look into it ASAP

vini-fda avatar Jan 12 '24 01:01 vini-fda