ValueError: FalconMambaForCausalLM does not support Flash Attention 2.0 yet
I'm facing issues while inferencing while using falcon LLM. The latency is around 20-30 minutes for a specific use case. I want to reduce the time and found that we can install Flash Attention 2 to significantly speedup inference. But I found that this is not supported in Falcon yet.
Request the team/members to look into the issue & address it ASAP.
@Cshekar24 FalconMamba is a full mamba model. it doesn't use attention at all.
As @avishaiElmakies mentions, Flash Attention can only be applied to models that use attention. That's not the case for FalconMamba, see: Welcome FalconMamba: The first strong attention-free 7B model
Request the team/members to look into the issue & address it ASAP.
There is no issue to address.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.