Implement Flash Attention Option
Would love to see a faster, more memory efficient attention implemented like Flash Attention. :)
In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder).
I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.
https://github.com/ggerganov/whisper.cpp/pull/284
Is it possible to implement multi-query attention then?
In
whisper.cppI tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.
also note in flexgen they use top 10% sparse attention
also note in flexgen they use top 10% sparse attention
Sparse attention is cool, but lossy. Flash attention is exact.