zhang662817

Results 6 comments of zhang662817

![image](https://github.com/NVIDIA/cutlass/assets/20987824/7229bc68-e352-4a63-a060-1e63d3ed0a34) explicit total Reg size: 128 * (64 + 16) * 2 * 4 / 1024.0 = 80K, which far less than sm regfile 256K; Is explicit total register size...

@thakkarV @ANIKET-SHIVAM What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is...

@thakkarV Env: cuda 12.2; pytorch docker: 23.10-py3. From gmem to smem, TMA does data conversion, right? From Acc to output gmem/smem in epilogue, data was converted in the loop, only...

in cuda 12.3; pytorch docker: 23.12-py3; still register spill; Change C/D Dtype to tf32, avoilding conversion, no change;

And how to config --tp-comm-overlap-cfg?

Ok, thanks. ScaleFactor in leader cta smem is copied to leader tmem and ScaleFactor in peer cta smem is copied to peer tmem, which both triggered by leader cta. When...