Tri Dao
Tri Dao
For FP8 we only support fwd pass for now.
Yes it's supported
for gmem copy you just want to issue the instructions and then do other work. Currently I don't think Gmem copy is slowing things down? It's possible to pipeline but...
Yeah, thinking more about it, on 4090 we should be able to get 70%+ tensor core util. The current version (e.g. FA2) might not get there maybe because our existing...
Do you know what TFLOPS you get with FA2 on 4090? The 4090 theoretical max is 165 TFLOPS if using fp32 accumulator and 330 TFLOPS is using fp16 acumulator. We...
FA3 FP8 code is already public in this repo. Accuracy is a open problem, I don't think the community has a consensus on what's the best way to quantize. One...
You can try out the only scalings you suggest (input fp16 but casted to fp8 for matmul) and measure accuracy. This can be done independent of FA3. I don't think...
It's just temporary.
Rotary implementation is 1-2 files, written in pytorch and triton. You can copy those files.
This file and the one it imports https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/layers/rotary.py