llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Implement Flash Attention Option

Open aeryncaen opened this issue 3 years ago • 2 comments

Would love to see a faster, more memory efficient attention implemented like Flash Attention. :)

aeryncaen avatar Mar 11 '23 18:03 aeryncaen

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

https://github.com/ggerganov/whisper.cpp/pull/284

ggerganov avatar Mar 12 '23 06:03 ggerganov

Is it possible to implement multi-query attention then?

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

ggerganov/whisper.cpp#284

Orevantum avatar Mar 13 '23 09:03 Orevantum

also note in flexgen they use top 10% sparse attention

xloem avatar Apr 05 '23 07:04 xloem

also note in flexgen they use top 10% sparse attention

Sparse attention is cool, but lossy. Flash attention is exact.

jamesbiederbeck avatar Aug 21 '23 03:08 jamesbiederbeck