burn
burn copied to clipboard
Performance WGPU: Improve Reduce kernels
The reduction of dimension is not fully parallelized, since one thread reduces a whole dimension alone with a for loop. To improve the performance of shaders, we should use a tree-based approach similar to what is described in the Nvidia CUDA documentation (https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf).
Reduce one dimension
- [ ] Sum dim
- [ ] Mean dim
- [ ] Argmax
- [ ] Argmin
Reduce full tensor to a scalar
- [ ] Sum
I can take it.
I think we don't have any benchmark for the reduction of dimension, so I guess it makes sense to add a separate task for it
Hey there! I'm interested in this issue as well. I've crudely implemented Mark Harris' algorithm for parallel reduction with WGSL and wgpu, and I'm doing some experiments/benchmarks to improve performance. Here's my github repo in which I perform the experiments.
I'm in very early experimentation, and I've only implemented sum-reduce with f32
in 1D. For now, one result that I got with my hardware (a laptop with an NVIDIA RTX 2060) is that my implementation is about 1.66x faster with an input size of 2^26 (i.e. 67,108,864 elements). Here's the plot generated with Criterion for three different input sizes (2^24, 2^25, 2^26) and comparing between CPU, Burn and my own wgpu implementation:
The above plot is overly optimistic and cherry-picked, evidently. Another benchmark, done from input sizes ranging from 2^20 to 2^28 (the max size for my GPU apparently) shows a less optimistic picture:
For maximum reproducibility, the benchmarks above were done under the implementation of this commit in my repo.
I believe this improvement over a narrow range can be explained by the fact that my implementation chooses a "work per thread" size of $k \cdot 2^{\lceil \log_2 {\lceil \log_2(n) \rceil} \rceil} $, i.e. $k$ times the smallest power of two greater than or equal to $\lceil \log_2(n) \rceil$, where $n$ is the input size (as the number of elements). In the benchmark above, I use $k=8$.
My rationale was based on Mark Harris' slides that @louisfd linked above, in which he says: "Brent’s theorem says each thread should sum $O(\log n)$ elements [but] In my experience, [it is] beneficial to push it even further". He cites three reasons for this:
- Possibly better latency hiding with more work per thread
- More threads per block reduces levels in tree of recursive kernel invocations
- High kernel launch overhead in last levels with few blocks
These are the parameters for n=2^24 to 2^28 elements w/ my implementation:
n | workgroup_size | work_per_thread | workgroups |
---|---|---|---|
2^20 = 1048576 | 1024 | 256 | 4 |
2^21 = 2097152 | 1024 | 256 | 8 |
2^22 = 4194304 | 1024 | 256 | 16 |
2^23 = 8388608 | 1024 | 256 | 32 |
2^24 = 16777216 | 1024 | 256 | 64 |
2^25 = 33554432 | 1024 | 256 | 128 |
2^26 = 67108864 | 1024 | 256 | 256 |
2^27 = 134217728 | 1024 | 256 | 512 |
2^28 = 268435456 | 1024 | 256 | 1024 |
Help on this is greatly appreciated! I'm still a beginner with GPGPU and I definitely missed many optimizations mentioned by the CUDA webinar.
@louisfd has worked on this issue in the past month. I think we now have two different algorithms for sum-dim
and mean-dim
. Not sure we have something new for sum
and mean
.
We also have benchmarks in burn-wgpu/benches/reduction.rs
, so you could compare the new algorithms.
Oh great! I hadn't seen it. I'll look into it ASAP