[Feature]: FlashAttention 3 support
As you know, flashatten3 promises 1.5x~ improvements Is there any plan for support? Thanks! https://github.com/Dao-AILab/flash-attention/commit/7ef24848cf2f855077cef88fe122775b727dcd74
@byshiue @nv-guomingz @nv-hwoo @juney-nvidia @AdamzNV @kaiyux @Shixiaowei02
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Is this feature under development? Has there been any progress?
It's currently under consideration, but development hasn't begun yet.
@AdamzNV I don't understand how an algorithm that you've already implemented the 2nd version of, that provides 1.5x speed boost, can still be under "consideration"
@avianion FA3 is nothing new except for utilizing hopper features (i.e. warp specialized kernels with TMA + GMMA). And this has already been implemented in TRT-LLM since the first public release.
- We have also done some benchmarks, which shows TRT-LLM FMHA kernels are faster in most cases (especially for longer sequence lengths). You are free to compare the performance. And let us know if you find cases that TRT-LLM is much worse. Thanks.
- Besides, we have much faster FP8 FMHA implementation (note that our implementation still uses per-tensor scales, so it might not be a fair comparison).
@PerkzZheng Thank you for your insights. Can you please share some open-source link for benchmarking results that you mentioned above? Also, when I compare TRT-LLM FMHA kernel (fmha_v2_flash_attention_fp16_64_64_S_qkv_256_causal_tma_ws_sm90_kernel) for input=16384, BS=1, the TFLOP/sec are almost the same as compared to Original FA V3 on H100. Can you please share a simple script to benchmark Trt-LLM FMHA Kernel or some insights if I missed anything. Thank you! @byshiue @QiJune @AdamzNV
@usajid14 you might want to try with head size 128/256 which should have better performance, but FA3 might have been updated so what we have collected is a bit out of dated. Let me know if there are cases that trtllm kernels are much worse. Thanks.