llama.cpp Implement Flash Attention Option

Would love to see a faster, more memory efficient attention implemented like Flash Attention. :)

Mar 11 '23 18:03 aeryncaen

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

https://github.com/ggerganov/whisper.cpp/pull/284

Mar 12 '23 06:03 ggerganov

Is it possible to implement multi-query attention then?

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

ggerganov/whisper.cpp#284

Mar 13 '23 09:03 Orevantum

also note in flexgen they use top 10% sparse attention

Apr 05 '23 07:04 xloem

also note in flexgen they use top 10% sparse attention

Sparse attention is cool, but lossy. Flash attention is exact.

Aug 21 '23 03:08 jamesbiederbeck