Make runtime assert more clear on CUDA
As stands, when a runtime assert is called on CUDA platforms your program just explodes with no stack trace and no mention of the error that was encountered. I just spent multiple hours debugging an issue where a CUTE_RUNTIME_ASSERT was called because I compiled for sm90 instead of sm90a. If the error message had been printed when CUTE_RUNTIME_ASSERT was called, this would have taken thirty seconds.
My understanding of CUTE_RUNTIME_ASSERT is that it should never be in good code, so even though printf() takes resources it should be fine to include.
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.