cutlass [QST] how to avoid register spill for example 48

commit id: 757275f2796bb901575c633e2a32bc76ca84ffec device arch: hopper;

change LayoutA to cutlass::layout::ColumnMajor; change LayoutB to cutlass::layout::RowMajor;

kernel will run RS kernel;

profiling result:

register spill;

change Tile to Shape<_128, _64, _32>;

no change.

How to config to avoid register spill

Jan 19 '24 08:01 zhang662817

explicit total Reg size: 128 * (64 + 16) * 2 * 4 / 1024.0 = 80K， which far less than sm regfile 256K;

Is explicit total register size calculation right?

Thanks.

Jan 19 '24 09:01 zhang662817

@thakkarV

Jan 19 '24 16:01 hwu36

tagging @IonThruster and @ANIKET-SHIVAM as well

Jan 19 '24 17:01 thakkarV

@zhang662817 have you been able to use the NT layout TF32 kernel from the CUTLASS profiler? You can copy its configuration since we know that one does not spill and has great performance.

Jan 19 '24 17:01 thakkarV

@thakkarV @ANIKET-SHIVAM What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is right?

Importantly, Our case is ElementA=float, LaoutA=ColumnMajor, ElementB=float, LayoutB=RowMajor; How to avoid register spill for these cases? and Why does register spill?

Thanks

Jan 21 '24 02:01 zhang662817

What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is right?

If you set op class to tensor op and set dtype to fp32, the kernel will do numeric conversion in the kernel after loading data from gmem. If you set it to tf32, no conversion will be performed.

What CUDA toolkit are you using? There were some fixes in the compiler that affected the performance of tf32 kernels in 12.1 and 12.2. They should not spill with a 128x64x32 tile shape for ping-pong and 128x128x32 tile shape for cooperative.

Jan 21 '24 18:01 thakkarV

@thakkarV
Env: cuda 12.2; pytorch docker: 23.10-py3.

From gmem to smem, TMA does data conversion, right? From Acc to output gmem/smem in epilogue, data was converted in the loop, only few register was required, right?

Thanks.

Jan 22 '24 01:01 zhang662817

in cuda 12.3; pytorch docker: 23.12-py3; still register spill;

Change C/D Dtype to tf32, avoilding conversion, no change;

Jan 22 '24 08:01 zhang662817

have you tried out the kernel from profiler corresponding to this layout ? I forget the optimized configuration that does not spill for TF32, but it is present in our profiler and I would recommend using that

Jan 22 '24 18:01 thakkarV

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 21 '24 19:02 github-actions[bot]

@zhang662817 has your issue been resolved?

Feb 22 '24 15:02 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 23 '24 16:03 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Jun 21 '24 17:06 github-actions[bot]