llama.cpp [Enhancement]: Implement optimizations used in CTranslate2

[Enhancement]: Implement optimizations used in CTranslate2

Open janekb04 opened this issue 1 year ago • 2 comments

CTranslate2 is a "competitor" to llama.cpp that advertises itself with:

Fast and efficient execution on CPU and GPU

The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.

I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?

Apr 06 '23 14:04 janekb04

(Hi there, I'm the author of CTranslate2.)

llama.cpp already implements similar optimizations. They often come naturally when reimplementing a model in C/C++.

In my experience the most impactful optimization is to integrate vendor specific libraries to run the matrix multiplications, which are usually the bottlenecks for these models. For example Apple Accelerate was a huge win for performance when it was first integrated in whisper.cpp. For x64 processors I recommend oneDNN which has a very good 8-bit GEMM implementation (as fast as Intel MKL).

However, I'm not aware of similar libraries providing efficient 4-bit GEMM at this time, and I also understand that llama.cpp is trying to avoid additional dependencies as much as possible.

Apr 08 '23 13:04 guillaumekln

So we are already fusing and tiling the attention layer to fit in CPU-SRAM ala flash attention?

Edit: I guess it is currently being experimented on: https://github.com/ggerganov/llama.cpp/pull/778

Apr 12 '23 01:04 jon-chuang

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 11 '24 01:04 github-actions[bot]

llama.cpp llama.cpp copied to clipboard

[Enhancement]: Implement optimizations used in CTranslate2

Fast and efficient execution on CPU and GPU

llama.cpp
llama.cpp copied to clipboard