transformers icon indicating copy to clipboard operation
transformers copied to clipboard

ValueError: FalconMambaForCausalLM does not support Flash Attention 2.0 yet

Open Cshekar24 opened this issue 1 year ago • 3 comments

I'm facing issues while inferencing while using falcon LLM. The latency is around 20-30 minutes for a specific use case. I want to reduce the time and found that we can install Flash Attention 2 to significantly speedup inference. But I found that this is not supported in Falcon yet.

Request the team/members to look into the issue & address it ASAP.

Cshekar24 avatar Sep 18 '24 10:09 Cshekar24

@Cshekar24 FalconMamba is a full mamba model. it doesn't use attention at all.

avishaiElmakies avatar Sep 18 '24 11:09 avishaiElmakies

As @avishaiElmakies mentions, Flash Attention can only be applied to models that use attention. That's not the case for FalconMamba, see: Welcome FalconMamba: The first strong attention-free 7B model

Request the team/members to look into the issue & address it ASAP.

There is no issue to address.

LysandreJik avatar Sep 18 '24 14:09 LysandreJik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 19 '24 08:10 github-actions[bot]