[release/2.5][ROCm][TunableOp] Improve identification of fastest solution (#144942)
This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300.
Changes include:
- An improved timer, StreamTimerNoSync
- More aggressive skipping of slow solutions
- Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942 Approved by: https://github.com/jeffdaily
(cherry picked from commit fd0cd6a08f706b7bb1dedb296217b6441e4fb9ff)
This is a performance improvement from upstream. So far, there have been no negative reports w.r.t. to performance. So, I think it's worth backporting. I will also add it to ROCm release/2.6. It cannot be trivially backported to release/2.4.
Jenkins build for acd66a22a6f79aa784015121cc22fa653ac1e9bb commit finished as FAILURE Links: Blue Ocean view / Build artifacts
Jenkins build for acd66a22a6f79aa784015121cc22fa653ac1e9bb commit finished as FAILURE Links: Blue Ocean view / Build artifacts
Jenkins build for acd66a22a6f79aa784015121cc22fa653ac1e9bb commit finished as FAILURE Links: Blue Ocean view / Build artifacts
Jenkins build for acd66a22a6f79aa784015121cc22fa653ac1e9bb commit finished as FAILURE Links: Blue Ocean view / Build artifacts
Jenkins build for acd66a22a6f79aa784015121cc22fa653ac1e9bb commit is in progress Links: Blue Ocean view / Build artifacts
!cherry-pick --onto release/2.6
Created branch autogenerated/release/2.6_cherry-pick_pr-2018 and https://github.com/ROCm/pytorch/pull/2041