Tri Dao
Tri Dao
It's probably because of compiler version. `UserWarning: The detected CUDA version (11.5) has a minor version mismatch with the version that was used to compile PyTorch (11.7).` Can you use...
Can you try the latest version (1.0.6)?
@davyeu FlashAttention doesn't work on V100.
I think this is an issue with ptxas and likely not an issue with the DSL compiler. I've hand-written different versions of the UMMA ptx code and yet ptxas still...
If there's a fix, either with newer ptxas versions, or ptxas options, or a different way to write the ptx that generates good SASS, please lmk. I'm eager to try...
I've tried CTK 12.8.0, 12.8.1, 12.9.0, 12.9.1, all with the same issue. I'll try to write a short self-contained ptx file and post it there.
Btw i've just tried w nvcc release 13.0, V13.0.48 and the issue is still there. Here's a snippet of the SASS after compiling example `77_blackwell_fmha_fp16`: ``` UMOV UR63, 0x8200010 ;...
> FWIW, at Modular, we found that we could avoid this issue and trigger reuse of the idesc and tmem uniform registers via using inline PTX for the series of...
Great, hopefully there's a version of cute-dsl that bundles ptxas 13.1 soon