FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

flashattention only enabled for gpt-styled models

Open flexwang opened this issue 1 year ago • 7 comments

Saw code here: https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h#L72

Any reason, flashattention shouldn't be used for encoder-only model?

flexwang avatar Aug 31 '23 04:08 flexwang

Why do you think that FMHA_ENABLE stands for FlashAttention?

niyunsheng avatar Sep 09 '23 16:09 niyunsheng

because eventually it will invoke here: https://github.com/NVIDIA/FasterTransformer/tree/afdf9a9eb86f15363c0249117d166d6b45dbb371/3rdparty/trt_fused_multihead_attention

flexwang avatar Sep 10 '23 02:09 flexwang

anyone have tested how much performance is improved when FMHA_ENABLE is enabled?

jiangsongHW avatar Sep 14 '23 01:09 jiangsongHW

I don't think FMHA_ENABLE stands for the FlashAttention, it stands for fused multi-head attention .

image

You can see gpt_guide.md for more information.

niyunsheng avatar Sep 14 '23 01:09 niyunsheng

@niyunsheng but if you track the code, https://github.com/NVIDIA/FasterTransformer/tree/afdf9a9eb86f15363c0249117d166d6b45dbb371/3rdparty/trt_fused_multihead_attention will be called and the file name is flashattention

flexwang avatar Sep 15 '23 04:09 flexwang

fmha means fused multi-head attention, it contsins flash attention kernel and non flash attention kernel. The selection is automatic. encoder-only model also has fmha kernel and it is enabled by default.

byshiue avatar Oct 20 '23 07:10 byshiue

@byshiue can I know why we don't use flash attention for encoder model as well?

flexwang2 avatar Oct 21 '23 01:10 flexwang2