Flash attention and flash decoding principles

Open RonanKMcGovern opened this issue 2 years ago • 6 comments

Are there plans to add flash attention and also flash decoding to allow for improved performance for long context?

Dec 11 '23 14:12 RonanKMcGovern

We'd love to have this. Our first priority is quantization but when we have the bandwidth we can look into adding Flash attention. (Note PRs are welcome)

Dec 11 '23 15:12 awni

We'd love to have this. Our first priority is quantization but when we have the bandwidth we can look into adding Flash attention. (Note PRs are welcome)

I'd messaged the maintainer of this project a few days ago because it seemed like he's dedicated to it and I saw he wanted to implement it in 1 or 2 other projects. But in case you can't get ahold of 'em, you have the link ¯_(ツ)_/¯

Dec 12 '23 23:12 BuildBackBuehler

This would be amazing! So that we can have integration in the amazing axolotl!

Apr 28 '24 12:04 ivanfioravanti