llama.cpp
llama.cpp copied to clipboard
[Enhancement]: Implement optimizations used in CTranslate2
CTranslate2 is a "competitor" to llama.cpp that advertises itself with:
Fast and efficient execution on CPU and GPU
The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.
I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?
(Hi there, I'm the author of CTranslate2.)
llama.cpp already implements similar optimizations. They often come naturally when reimplementing a model in C/C++.
In my experience the most impactful optimization is to integrate vendor specific libraries to run the matrix multiplications, which are usually the bottlenecks for these models. For example Apple Accelerate was a huge win for performance when it was first integrated in whisper.cpp. For x64 processors I recommend oneDNN which has a very good 8-bit GEMM implementation (as fast as Intel MKL).
However, I'm not aware of similar libraries providing efficient 4-bit GEMM at this time, and I also understand that llama.cpp is trying to avoid additional dependencies as much as possible.
So we are already fusing and tiling the attention layer to fit in CPU-SRAM ala flash attention?
Edit: I guess it is currently being experimented on: https://github.com/ggerganov/llama.cpp/pull/778
This issue was closed because it has been inactive for 14 days since being marked as stale.