Louis Fortier-Dubois

Results 14 issues of Louis Fortier-Dubois

The reduction of dimension is not fully parallelized, since one thread reduces a whole dimension alone with a for loop. To improve the performance of shaders, we should use a...

performance
wgpu

In autodiff, we should have a checkpointing strategy for better memory consumption (see for instance [https://www-sop.inria.fr/tropics/papers/DauvergneHascoet06.pdf](https://www-sop.inria.fr/tropics/papers/DauvergneHascoet06.pdf)) . Currently, for most operations run in the forward pass, a state will be...

performance
very hard

Dirty PR here, not meant to merge @nathanielsimard @antimora I benchmarked linear without (Linear) and with (LinearT) the new weight order change. It appears to be in general quite slower...

In autotune, we create random tensors as inputs. Since random only works with floats it makes it impossible to use autotune with int and bool. Lazy solution: revert to creating...

enhancement

In burn-train, several metrics can be used during training. It would be great to have more! - [X] Accuracy - [X] Loss (the one in use) - [X] CUDA utilization...

good first issue
feature

The operation `slice_assign` is very important for us in backward and we still don't have it with the Candle backend. There is an [issue open on the Candle GitHub](https://github.com/huggingface/candle/issues/1351). If...

enhancement

Surprisingly, convolutions are already rather fast (especially transposed convolutions), but they probably can be improved further using shared/local memory, leveraging memory coalescing. - [ ] Convolution 2D - [ ]...

performance
wgpu

At the moment, the autotune mechanism is not able to support no_std [because we use Instant](https://durch.github.io/rust-goauth/time/index.html) We must do an analysis to see if we could use something else to...

no_std

For Matmul in WGPU, we use autotune with a key that tells us implicitly the range in which the inputs of matmul are. For instance, if we have [3, 2,...

enhancement

Use the autotune mechanism in the WGPU backend to find the fastest kernel version of binary_elemewise and its inplace counterpart, by varying the WORKGROUP argument. Take inspiration from the `burn-wgpu/src/kernel/matmul/tune`...

performance
wgpu