[FEA] CVT F32 -> TF32 PTX for sm80
Is your feature request related to a problem? Please describe.
Currently, converting from tf32 to f32 with round to nearest dispatches to a PTX cvt instruction only for sm90.
Describe the solution you'd like
If we allow rna rounding, we can dispatch to cvt.rna.tf32.f32, which works for sm80.
Describe alternatives you've considered N/A
Additional context A simple code sample is given below:
__global__ void f2tfK() {
constexpr float x = -0.45466f;
uint32_t d = 0;
constexpr auto f2tf = cutlass::NumericConverter<cutlass::tfloat32_t, float>{};
asm volatile("cvt.rna.tf32.f32 %0, %1;" : "=r"(d) : "f"(x));
const auto res = cutlass::tfloat32_t::bitcast(d);
const auto cRes = f2tf(x);
printf("Intrinsic: "); cute::print(res); printf("\n");
printf("Other: "); cute::print(cRes); printf("\n");
printf("isEqual? %s\n", cRes == res ? "yes" : "no");
}
// Output:
// Intrinsic: -0.454590
// Other: -0.454590
// isEqual? yes
@yzhaiustc To motivate this feature, see the attached SASS source analysis for the current conversion on sm80. Note that the conversion that would take a single cvt instruction, currently takes >16 (lines 3512 - 3529) in the worst case, where both if branches are taken.
okay, could you please file a PR to us?
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.