ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

Add tuning to reduction kernels and improve tuning

Open upsj opened this issue 2 years ago • 2 comments

This uses the tuning parameter #692 to enable tuning the oversubscription parameter for kernel reductions, and adds vendor BLAS reductions to the benchmark

TODO:

  • [ ] Tune CUDA
    • [ ] Pascal
    • [ ] Volta
    • [ ] Turing
    • [ ] Ampere
  • [ ] Tune ROCm
    • [ ] Radeon VII
    • [ ] MI100
  • [ ] Tune DPC++
    • [ ] CPU
    • [ ] DG1
    • [ ] ...
  • [ ] Tune OpenMP?
  • [ ] Tune multiple RHS

upsj avatar Mar 21 '22 14:03 upsj

Some first results here from our Titan X vs. cuBLAS Tuning parameter is the oversubscription, i.e. number of launched warps / max number of active warps

tuning-blas

For small inputs, not oversubscribing at all gives the best results (up to roughly 10k elements), then cuBLAS starts to take over. The performance for larger oversubscription counts have a bit of a dip all at the same input size that should be easy to eliminate with this knowledge (allocation also probably plays a role). For larger inputs, the more we oversubscribe, the more we win. So I think in general, this shows that we can be on par with cuBLAS, just need to tweak the parameters a bit. sp_* is cuBLAS, the rest is ours

EDIT: Note that I am using logscale here, so small differences may not be visible

upsj avatar Mar 21 '22 16:03 upsj

Error: The following files need to be formatted:

common/cuda_hip/base/kernel_launch_reduction.hpp.inc

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

ginkgo-bot avatar Apr 11 '22 14:04 ginkgo-bot

This deserves some more attention beyond just playing around with tuning parameters. I'll close it for now

upsj avatar Nov 21 '23 12:11 upsj