[QST] how to avoid register spill for example 48
commit id: 757275f2796bb901575c633e2a32bc76ca84ffec device arch: hopper;
change LayoutA to cutlass::layout::ColumnMajor;
change LayoutB to cutlass::layout::RowMajor;
kernel will run RS kernel;
profiling result:
register spill;
change Tile to Shape<_128, _64, _32>;
no change.
How to config to avoid register spill
explicit total Reg size: 128 * (64 + 16) * 2 * 4 / 1024.0 = 80K, which far less than sm regfile 256K;
Is explicit total register size calculation right?
Thanks.
@thakkarV
tagging @IonThruster and @ANIKET-SHIVAM as well
@zhang662817 have you been able to use the NT layout TF32 kernel from the CUTLASS profiler? You can copy its configuration since we know that one does not spill and has great performance.
@thakkarV @ANIKET-SHIVAM What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is right?
Importantly, Our case is ElementA=float, LaoutA=ColumnMajor, ElementB=float, LayoutB=RowMajor; How to avoid register spill for these cases? and Why does register spill?
Thanks
What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is right?
If you set op class to tensor op and set dtype to fp32, the kernel will do numeric conversion in the kernel after loading data from gmem. If you set it to tf32, no conversion will be performed.
What CUDA toolkit are you using? There were some fixes in the compiler that affected the performance of tf32 kernels in 12.1 and 12.2. They should not spill with a 128x64x32 tile shape for ping-pong and 128x128x32 tile shape for cooperative.
@thakkarV
Env: cuda 12.2; pytorch docker: 23.10-py3.
From gmem to smem, TMA does data conversion, right? From Acc to output gmem/smem in epilogue, data was converted in the loop, only few register was required, right?
Thanks.
in cuda 12.3; pytorch docker: 23.12-py3; still register spill;
Change C/D Dtype to tf32, avoilding conversion, no change;
have you tried out the kernel from profiler corresponding to this layout ? I forget the optimized configuration that does not spill for TF32, but it is present in our profiler and I would recommend using that
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
@zhang662817 has your issue been resolved?
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.