zhang662817 comments

Results 6 comments of


                                            zhang662817

[QST] how to avoid register spill for example 48

![image](https://github.com/NVIDIA/cutlass/assets/20987824/7229bc68-e352-4a63-a060-1e63d3ed0a34) explicit total Reg size: 128 * (64 + 16) * 2 * 4 / 1024.0 = 80K， which far less than sm regfile 256K; Is explicit total register size...

[QST] how to avoid register spill for example 48

@thakkarV @ANIKET-SHIVAM What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is...

[QST] how to avoid register spill for example 48

@thakkarV Env: cuda 12.2; pytorch docker: 23.10-py3. From gmem to smem, TMA does data conversion, right? From Acc to output gmem/smem in epilogue, data was converted in the loop, only...

[QST] how to avoid register spill for example 48

in cuda 12.3; pytorch docker: 23.12-py3; still register spill; Change C/D Dtype to tf32, avoilding conversion, no change;

[BUG] Crash when enable --tp-comm-overlap

And how to config --tp-comm-overlap-cfg?

[QST]Question about Utccpop

Ok, thanks. ScaleFactor in leader cta smem is copied to leader tmem and ScaleFactor in peer cta smem is copied to peer tmem, which both triggered by leader cta. When...