ginkgo
ginkgo copied to clipboard
Add tuning to reduction kernels and improve tuning
This uses the tuning parameter #692 to enable tuning the oversubscription parameter for kernel reductions, and adds vendor BLAS reductions to the benchmark
TODO:
- [ ] Tune CUDA
- [ ] Pascal
- [ ] Volta
- [ ] Turing
- [ ] Ampere
- [ ] Tune ROCm
- [ ] Radeon VII
- [ ] MI100
- [ ] Tune DPC++
- [ ] CPU
- [ ] DG1
- [ ] ...
- [ ] Tune OpenMP?
- [ ] Tune multiple RHS
Some first results here from our Titan X vs. cuBLAS Tuning parameter is the oversubscription, i.e. number of launched warps / max number of active warps
For small inputs, not oversubscribing at all gives the best results (up to roughly 10k elements), then cuBLAS starts to take over. The performance for larger oversubscription counts have a bit of a dip all at the same input size that should be easy to eliminate with this knowledge (allocation also probably plays a role). For larger inputs, the more we oversubscribe, the more we win. So I think in general, this shows that we can be on par with cuBLAS, just need to tweak the parameters a bit. sp_*
is cuBLAS, the rest is ours
EDIT: Note that I am using logscale here, so small differences may not be visible
Error: The following files need to be formatted:
common/cuda_hip/base/kernel_launch_reduction.hpp.inc
You can find a formatting patch under Artifacts here or run format!
if you have write access to Ginkgo
This deserves some more attention beyond just playing around with tuning parameters. I'll close it for now