burn
burn copied to clipboard
Benchmarks WGPU: Add benchmarks for reduce operations
Feature description
Add benchmarks for reduce operations: Reduce one dimension:
- Sum dim
- Mean dim
- Argmax
- Argmin
Reduce full tensor to a scalar:
- Sum
There is an open issue to improve the performance of reduce kernels: https://github.com/burn-rs/burn/issues/536 Before starting to work on performance, we need proper benchmarks.
I'm interested in this, and I've already done some early experimentation to measure the performance of burn externally (i.e. using it as an external crate) and compare it with my own implementation. I have a few questions about these internal benchmarks before contributing:
- How do we avoid memory allocation and device-host synchronization overhead? or, in other words, how do we account for it in our measurements?
- Should we measure CPU time, GPU time (through timestamp writes for example) or both?
- As the reduction kernels are of low operational intensity, should we measure the performance of the kernels in isolation (i.e. a hot loop) or should we also measure the performance of the kernels in a more realistic scenario (i.e. a kernel that does some work before and after the reduction kernel)?
There are some benchmarks in burn-wgpu/benches/reduction.rs
, but we could put them in backend-comparison
instead. We are missing benchmarks for global reduction such as mean
and sum
.