transformers ValueError: FalconMambaForCausalLM does not support Flash Attention 2.0 yet

I'm facing issues while inferencing while using falcon LLM. The latency is around 20-30 minutes for a specific use case. I want to reduce the time and found that we can install Flash Attention 2 to significantly speedup inference. But I found that this is not supported in Falcon yet.

Request the team/members to look into the issue & address it ASAP.

Sep 18 '24 10:09 Cshekar24

@Cshekar24 FalconMamba is a full mamba model. it doesn't use attention at all.

Sep 18 '24 11:09 avishaiElmakies

As @avishaiElmakies mentions, Flash Attention can only be applied to models that use attention. That's not the case for FalconMamba, see: Welcome FalconMamba: The first strong attention-free 7B model

Request the team/members to look into the issue & address it ASAP.

There is no issue to address.

Sep 18 '24 14:09 LysandreJik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 19 '24 08:10 github-actions[bot]