llama.cpp Demo usage of Flash Attention

Demo usage of Flash Attention

Open ggerganov opened this issue 1 year ago • 7 comments

This is my understanding of how Flash Attention works based on this picture:

ref: https://github.com/HazyResearch/flash-attention

The implementation is here:

https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367

I don't plan on merging this because on M1 it is the same performance as without FA. However, in whisper.cpp I have gained performance from using this same exact call in the Encoder:

https://github.com/ggerganov/whisper.cpp/blob/0a2d1210bcb98978214bbf4e100922a413afd39d/whisper.cpp#L1482-L1508

Putting this here if someone wants to play with it or figures out how to implement sparse attention. The idea is just to merge the ggml operators into a single op and avoid intermediate tensors.

Apr 05 '23 15:04 ggerganov

llama.cpp llama.cpp copied to clipboard

Demo usage of Flash Attention

llama.cpp
llama.cpp copied to clipboard