flash-linear-attention
flash-linear-attention copied to clipboard
Hello from HF Diffusers
Thanks for the incredibly clean repository!
I am Sayak from the Diffusers team at Hugging Face. My question is probably very naive, so I apologize for that in advance.
I wanted to know if linear attention could applied in inference time only? More precisely, can I take a model trained with regular attention and turn it into a linear attention model during inference?
Hello Sayak,
Thanks for your interest! Unfortunately, we cannot directly convert a model with softmax attention to one with linear attention during inference without any additional training. However, it is indeed possible to finetune pretrained LLMs for a few steps—much fewer than training from scratch—to switch from regular attention to linear attention. You can refer to these resources for more details: arXiv:2405.06640, OpenReview, etc.
@sayakpaul FYI, we release some weights converted from Mistral-7B-v0.1 as in arXiv:2405.06640. You can have a try by loading fla-hub/gla-7B-mistral-20B, fla-hub/gsa-7B-mistral-20B or fla-hub/gsa-7B-mistral-100B
Would there be any interest to try something similar for diffusion models? I am happy to team up. I believe this will be a very significant contribution to the community now that large models like Flux are coming up.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.