llama.cpp
llama.cpp copied to clipboard
Demo usage of Flash Attention
This is my understanding of how Flash Attention works based on this picture:
ref: https://github.com/HazyResearch/flash-attention
The implementation is here:
https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367
I don't plan on merging this because on M1 it is the same performance as without FA.
However, in whisper.cpp
I have gained performance from using this same exact call in the Encoder:
https://github.com/ggerganov/whisper.cpp/blob/0a2d1210bcb98978214bbf4e100922a413afd39d/whisper.cpp#L1482-L1508
Putting this here if someone wants to play with it or figures out how to implement sparse attention.
The idea is just to merge the ggml
operators into a single op and avoid intermediate tensors.