Tri Dao comments

Results 639 comments of


                                            Tri Dao

High memory requirements when compiling

Do you have any insight into "template specialization gone wrong"?

pretraining loss blows up with Triton Flash attention

The Triton implementation is experimental, I did see some race conditions from the Triton compiler on the backward pass (see comments in the source code) that I tried to fix....

Can Flash Attention 3 run on A100？

The latest version supports A100 now

Can Flash Attention 3 run on A100？

Make sure you remove the previously installed package before reinstalling. E.g. for me it's `rm -rf /usr/local/lib/python3.12/dist-packages/flash_attn-3.0*` but that depends on your machine

Can Flash Attention 3 run on A100？

`main` branch. Your issue seems to be that it's running an old version. Latest on `main` doesn't have `TORCH_CHECK(is_sm9x, "FlashAttentionHopper only supports Hopper GPUs or newer.")` anymore.

Can Flash Attention 3 run on A100？

FA3 Ampere isn't much faster than FA2 on A100, since FA2 already gets close to peak performance. FA3 Ampere is a bit faster, with more features (packGQA for decoding, arbitrary...

Can Flash Attention 3 run on A100？

What's the TFLOPS?

Can Flash Attention 3 run on A100？

No what's the TFLOPS that the attn kernel is getting, out of a theoreical max of 312 TFLOPS (bf16)? If it's getting 60-70% of theoretical max, there's not much to...

Does FA3 varlen func support pad between sequences?

Yes that's right. But note that the kernel won't touch the output memory of the padding tokens, so the output for the padding tokens will be uninitialized (it could contain...

Does FA3 varlen func support pad between sequences?

If you need to, you can zero out parts that are not initialized in the output and grad (i.e. padding tokens) yourself. This API isn't really designed for padding tokens...