zhang662817
zhang662817
 explicit total Reg size: 128 * (64 + 16) * 2 * 4 / 1024.0 = 80K, which far less than sm regfile 256K; Is explicit total register size...
@thakkarV @ANIKET-SHIVAM What's the difference bewteen float and tf32? In culass, float uses tf32 tcore and tf32 alse uses 32 bit in storage in shared smem and register file, is...
@thakkarV Env: cuda 12.2; pytorch docker: 23.10-py3. From gmem to smem, TMA does data conversion, right? From Acc to output gmem/smem in epilogue, data was converted in the loop, only...
in cuda 12.3; pytorch docker: 23.12-py3; still register spill; Change C/D Dtype to tf32, avoilding conversion, no change;
And how to config --tp-comm-overlap-cfg?
Ok, thanks. ScaleFactor in leader cta smem is copied to leader tmem and ScaleFactor in peer cta smem is copied to peer tmem, which both triggered by leader cta. When...