FasterTransformer
FasterTransformer copied to clipboard
flashattention only enabled for gpt-styled models
Saw code here: https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h#L72
Any reason, flashattention shouldn't be used for encoder-only model?
Why do you think that FMHA_ENABLE
stands for FlashAttention?
because eventually it will invoke here: https://github.com/NVIDIA/FasterTransformer/tree/afdf9a9eb86f15363c0249117d166d6b45dbb371/3rdparty/trt_fused_multihead_attention
anyone have tested how much performance is improved when FMHA_ENABLE is enabled?
I don't think FMHA_ENABLE
stands for the FlashAttention, it stands for fused multi-head attention
.
You can see gpt_guide.md for more information.
@niyunsheng but if you track the code, https://github.com/NVIDIA/FasterTransformer/tree/afdf9a9eb86f15363c0249117d166d6b45dbb371/3rdparty/trt_fused_multihead_attention will be called and the file name is flashattention
fmha means fused multi-head attention
, it contsins flash attention kernel and non flash attention kernel. The selection is automatic.
encoder-only model also has fmha kernel and it is enabled by default.
@byshiue can I know why we don't use flash attention for encoder model as well?