llama.cpp
llama.cpp copied to clipboard
perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup
Benchmark results:
sizey=sizez=N,sizex=K,n_threads=8
K=8,N=8192,AVX2,FLOPS/us=27148.97
K=8,N=8192,AVX,FLOPS/us=15193.96
K=8,N=8192,default,FLOPS/us=1781.05
K=16,N=8192,AVX2,FLOPS/us=20128.26
K=16,N=8192,AVX,FLOPS/us=8224.13
K=16,N=8192,default,FLOPS/us=3540.52
K=32,N=8192,AVX2,FLOPS/us=13127.55
K=32,N=8192,AVX,FLOPS/us=9397.48
K=32,N=8192,default,FLOPS/us=6386.55
K=48,N=8192,AVX2,FLOPS/us=13206.16
K=48,N=8192,AVX,FLOPS/us=5801.21
K=48,N=8192,default,FLOPS/us=8199.44
K=64,N=8192,AVX2,FLOPS/us=10505.51
K=64,N=8192,AVX,FLOPS/us=6353.32
K=64,N=8192,default,FLOPS/us=13024.33
We choose the K cutoff point to be 32 for AVX and 48 for AVX2.
Partial fix to: https://github.com/ggerganov/llama.cpp/issues/956
Will stop the investigation here for now. Time taken for applying LoRA is quite tolerable with these changes.
We are still quite far from optimal; for instance, I see 250KFLOPs/us on matmuls with high K (K=10000).
LoRA application informal benchmarks:
K=16
AVX2 - 5141.57 ms
AVX - 9831.28 ms
default - 22611.96 ms
This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.
Branch breaks cmake build:
[ 95%] Building C object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-q4_0-matmult.c.o
In file included from llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:11:
llama.cpp/./llama.h:77:22: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
77 | LLAMA_API struct llama_context_params llama_context_default_params();
| ^~~~~~~~~~~~~~~~~~~~
llama.cpp/./llama.h:79:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
79 | LLAMA_API bool llama_mmap_supported();
| ^~~~~~~~~
llama.cpp/./llama.h:80:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
80 | LLAMA_API bool llama_mlock_supported();
| ^~~~~~~~~
llama.cpp/./llama.h:158:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
158 | LLAMA_API llama_token llama_token_bos();
| ^~~~~~~~~
llama.cpp/./llama.h:159:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
159 | LLAMA_API llama_token llama_token_eos();
| ^~~~~~~~~
llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:14:10: fatal error: cstring: No such file or directory
14 | #include <cstring>
| ^~~~~~~~~
I'm able to reproduce the performance improvements, impressive work!
Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.
conflicts and failing CI
This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.
No, tall and skinny looks like:
__
| | ___________
| | X |__________|
|__|
Matrix to vector is
______________ __
| | | |
| | X | |
| ____________| |__|
No, tall and skinny looks like:
Thanks, I don't understand how it's tall and skinny though. :)
I'm not fat, I'm just big boned.
__ /
| | ,____O_____,
| | X =|__________|=
|__| / \
Thanks, I don't understand how it's tall and skinny though. :)
:rofl:
Well, in our specific context, the matrices in the matmul are $B A^T$, so both B, A are tall and skinny. ($A^T$ being short and wide)
A mat mul is tall and skinny as long as the dimension along which the matrices are multiplied is small compared to the adjacent dimension of one of the matrices. So the specific orientation does not matter.
// i can only get it working with sse1 like the following because i have no FMA on my machine // c_vec = _mm256_fmadd_ps(a, b_vec, c_vec); // FMA: c_vec += a * b_vec c_vec = _mm_add_ps(c_vec, _mm_mul_ps(b_vec, a)); // i suppose it ruins the effort if one considers avx2
PTAL @slaren @ggerganov. Please refer to informal LoRA benchmarks for e2e validation.
@syl-00110111 Thanks for that info. I will simply not support no FMA. You may expand on this PR (e.g. in follow up PR) if you wish for non-FMA support by providing appropriate benchmarks.
@e271828-
Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.
Can you clarify - does the RWKV inference benefit from this change, and if so - can you provide some rough numbers
@jon-chuang
Are we confident that the computation is correct? Maybe we should add an accuracy test comparing the results against the default matrix multiplication
No updates on this?
Hello, I've been on holiday. I wrote a test, there are some bugs, so I'm fixing them.
Apologies, no longer motivated to fix. Anyone who is interested, please take a look and continue.