flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

Compatibility of Flash Attention 3 FP8 Feature with L40 and A100 GPUs

Open feifeibear opened this issue 1 year ago • 2 comments
trafficstars

Thanks for open-sourcing FA3, good job! I am wondering about the FP8 feature.

Compatibility: Are the NVIDIA L40 and A100 GPUs compatible with the Flash Attention 3 FP8 feature?

Performance: What are the expected performance gains or trade-offs when using Flash Attention 3 FP8 on these GPUs?

Implementation: Is there any specific implementation or software requirement to enable Flash Attention 3 FP8 on L40 and A100 GPUs?

feifeibear avatar Jul 12 '24 07:07 feifeibear

fa3 seems to be designed for hopper architecture (h100) so a100 would not see performances boost. FP8 is not natively supporting neither on a100.

samsja avatar Jul 22 '24 09:07 samsja

fa3 似乎是为 hopper 架构 (h100) 设计的,因此 a100 的性能不会提升。FP8 在 a100 上本身不支持。

Maybe Flash Attention 3 fp8 will be supported on 4090?

songh11 avatar Jul 22 '24 09:07 songh11

Is it possible to apply warp-specialized software pipelining scheme on A100?

KyeeHuang avatar Aug 12 '24 07:08 KyeeHuang

It's not commonly done. FA2 is already close to optimal on A100 (70% max theoretical FLOPS).

tridao avatar Aug 12 '24 08:08 tridao

Well, for some other GPUs (such as AMD GPUs or other manufacturers' GPUs, they do not have most of the features of the Hopper architecture, etc. TMA, WGMMA or FP8), if I want to optimize or design a specific flash-attention, is it better or easier to design follow the fa3’s warp-specialized method? Or it is not necessary?

KyeeHuang avatar Aug 12 '24 08:08 KyeeHuang

Warp-specialization will be difficult without the async features. Overlapping gemm and softmax would still be useful.

tridao avatar Aug 12 '24 08:08 tridao

so does it work on L40?

asahni-sc avatar Jan 30 '25 17:01 asahni-sc