ginkgo Add tuning to reduction kernels and improve tuning

Add tuning to reduction kernels and improve tuning

Open upsj opened this issue 2 years ago • 2 comments

This uses the tuning parameter #692 to enable tuning the oversubscription parameter for kernel reductions, and adds vendor BLAS reductions to the benchmark

TODO:

[ ] Tune CUDA
- [ ] Pascal
- [ ] Volta
- [ ] Turing
- [ ] Ampere
[ ] Tune ROCm
- [ ] Radeon VII
- [ ] MI100
[ ] Tune DPC++
- [ ] CPU
- [ ] DG1
- [ ] ...
[ ] Tune OpenMP?
[ ] Tune multiple RHS

Mar 21 '22 14:03 upsj

Some first results here from our Titan X vs. cuBLAS Tuning parameter is the oversubscription, i.e. number of launched warps / max number of active warps

tuning-blas

For small inputs, not oversubscribing at all gives the best results (up to roughly 10k elements), then cuBLAS starts to take over. The performance for larger oversubscription counts have a bit of a dip all at the same input size that should be easy to eliminate with this knowledge (allocation also probably plays a role). For larger inputs, the more we oversubscribe, the more we win. So I think in general, this shows that we can be on par with cuBLAS, just need to tweak the parameters a bit. sp_* is cuBLAS, the rest is ours

EDIT: Note that I am using logscale here, so small differences may not be visible

Mar 21 '22 16:03 upsj

Error: The following files need to be formatted:

common/cuda_hip/base/kernel_launch_reduction.hpp.inc

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

Apr 11 '22 14:04 ginkgo-bot

This deserves some more attention beyond just playing around with tuning parameters. I'll close it for now

Nov 21 '23 12:11 upsj

ginkgo ginkgo copied to clipboard

Add tuning to reduction kernels and improve tuning

ginkgo
ginkgo copied to clipboard