Tri Dao comments

Results 639 comments of


                                            Tri Dao

[Question]Does training and inference use the same quantization method in FA3?

For FP8 we only support fwd pass for now.

[Question] Compatibility and Support for NVIDIA RTX™ 6000 Ada Generation GPU

Yes it's supported

Pipelining GmemCopy on kHeadDim

for gmem copy you just want to issue the instructions and then do other work. Currently I don't think Gmem copy is slowing things down? It's possible to pipeline but...

Pipelining GmemCopy on kHeadDim

Yeah, thinking more about it, on 4090 we should be able to get 70%+ tensor core util. The current version (e.g. FA2) might not get there maybe because our existing...

Pipelining GmemCopy on kHeadDim

Do you know what TFLOPS you get with FA2 on 4090? The 4090 theoretical max is 165 TFLOPS if using fp32 accumulator and 330 TFLOPS is using fp16 acumulator. We...

FP8 for flash attention 3 and possible concerns

FA3 FP8 code is already public in this repo. Accuracy is a open problem, I don't think the community has a consensus on what's the best way to quantize. One...

FP8 for flash attention 3 and possible concerns

You can try out the only scalings you suggest (input fp16 but casted to fp8 for matmul) and measure accuracy. This can be done independent of FA3. I don't think...

Why does FA3 requires the input tensors to be contiguous ?

It's just temporary.

Possible to use rotary embedding without flash attention?

Rotary implementation is 1-2 files, written in pytorch and triton. You can copy those files.

Possible to use rotary embedding without flash attention?

This file and the one it imports https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/layers/rotary.py