Tri Dao comments

Results 639 comments of


                                            Tri Dao

Error install flash-attn

It's probably because of compiler version. `UserWarning: The detected CUDA version (11.5) has a minor version mismatch with the version that was used to compile PyTorch (11.7).` Can you use...

Error install flash-attn

Can you try the latest version (1.0.6)?

Error install flash-attn

@davyeu FlashAttention doesn't work on V100.

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

I think this is an issue with ptxas and likely not an issue with the DSL compiler. I've hand-written different versions of the UMMA ptx code and yet ptxas still...

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

If there's a fix, either with newer ptxas versions, or ptxas options, or a different way to write the ptx that generates good SASS, please lmk. I'm eager to try...

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

I've tried CTK 12.8.0, 12.8.1, 12.9.0, 12.9.1, all with the same issue. I'll try to write a short self-contained ptx file and post it there.

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

Btw i've just tried w nvcc release 13.0, V13.0.48 and the issue is still there. Here's a snippet of the SASS after compiling example `77_blackwell_fmha_fp16`: ``` UMOV UR63, 0x8200010 ;...

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

> FWIW, at Modular, we found that we could avoid this issue and trigger reuse of the idesc and tmem uniform registers via using inline PTX for the series of...

[BUG] Cutlass and Cute-DSL generate suboptimal code for UMMA that use more registers than necessary

Great, hopefully there's a version of cute-dsl that bundles ptxas 13.1 soon