Tri Dao

Results 639 comments of Tri Dao

Do you have any insight into "template specialization gone wrong"?

The Triton implementation is experimental, I did see some race conditions from the Triton compiler on the backward pass (see comments in the source code) that I tried to fix....

The latest version supports A100 now

Make sure you remove the previously installed package before reinstalling. E.g. for me it's `rm -rf /usr/local/lib/python3.12/dist-packages/flash_attn-3.0*` but that depends on your machine

`main` branch. Your issue seems to be that it's running an old version. Latest on `main` doesn't have `TORCH_CHECK(is_sm9x, "FlashAttentionHopper only supports Hopper GPUs or newer.")` anymore.

FA3 Ampere isn't much faster than FA2 on A100, since FA2 already gets close to peak performance. FA3 Ampere is a bit faster, with more features (packGQA for decoding, arbitrary...

No what's the TFLOPS that the attn kernel is getting, out of a theoreical max of 312 TFLOPS (bf16)? If it's getting 60-70% of theoretical max, there's not much to...

Yes that's right. But note that the kernel won't touch the output memory of the padding tokens, so the output for the padding tokens will be uninitialized (it could contain...

If you need to, you can zero out parts that are not initialized in the output and grad (i.e. padding tokens) yourself. This API isn't really designed for padding tokens...