burn Benchmarks WGPU: Add benchmarks for reduce operations

Benchmarks WGPU: Add benchmarks for reduce operations

Open mmalczak opened this issue 1 year ago • 2 comments

Feature description

Add benchmarks for reduce operations: Reduce one dimension:

Sum dim
Mean dim
Argmax
Argmin

Reduce full tensor to a scalar:

There is an open issue to improve the performance of reduce kernels: https://github.com/burn-rs/burn/issues/536 Before starting to work on performance, we need proper benchmarks.

Aug 03 '23 20:08 mmalczak

I'm interested in this, and I've already done some early experimentation to measure the performance of burn externally (i.e. using it as an external crate) and compare it with my own implementation. I have a few questions about these internal benchmarks before contributing:

How do we avoid memory allocation and device-host synchronization overhead? or, in other words, how do we account for it in our measurements?
Should we measure CPU time, GPU time (through timestamp writes for example) or both?
As the reduction kernels are of low operational intensity, should we measure the performance of the kernels in isolation (i.e. a hot loop) or should we also measure the performance of the kernels in a more realistic scenario (i.e. a kernel that does some work before and after the reduction kernel)?

Jan 10 '24 20:01 vini-fda

There are some benchmarks in burn-wgpu/benches/reduction.rs, but we could put them in backend-comparison instead. We are missing benchmarks for global reduction such as mean and sum.

Jan 11 '24 14:01 nathanielsimard

burn burn copied to clipboard

Benchmarks WGPU: Add benchmarks for reduce operations

Feature description

burn
burn copied to clipboard