llama.cpp perf: AVX2/AVX routines for tall and skinny matmul

Benchmark results:

sizey=sizez=N,sizex=K,n_threads=8

K=8,N=8192,AVX2,FLOPS/us=27148.97
K=8,N=8192,AVX,FLOPS/us=15193.96
K=8,N=8192,default,FLOPS/us=1781.05

K=16,N=8192,AVX2,FLOPS/us=20128.26
K=16,N=8192,AVX,FLOPS/us=8224.13
K=16,N=8192,default,FLOPS/us=3540.52

K=32,N=8192,AVX2,FLOPS/us=13127.55
K=32,N=8192,AVX,FLOPS/us=9397.48
K=32,N=8192,default,FLOPS/us=6386.55

K=48,N=8192,AVX2,FLOPS/us=13206.16
K=48,N=8192,AVX,FLOPS/us=5801.21
K=48,N=8192,default,FLOPS/us=8199.44

K=64,N=8192,AVX2,FLOPS/us=10505.51
K=64,N=8192,AVX,FLOPS/us=6353.32
K=64,N=8192,default,FLOPS/us=13024.33

We choose the K cutoff point to be 32 for AVX and 48 for AVX2.

Partial fix to: https://github.com/ggerganov/llama.cpp/issues/956

Will stop the investigation here for now. Time taken for applying LoRA is quite tolerable with these changes.

We are still quite far from optimal; for instance, I see 250KFLOPs/us on matmuls with high K (K=10000).

LoRA application informal benchmarks:

K=16
AVX2 - 5141.57 ms
AVX - 9831.28 ms
default - 22611.96 ms

Apr 15 '23 13:04 jon-chuang

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

Apr 16 '23 15:04 KerfuffleV2

Branch breaks cmake build:

[ 95%] Building C object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-q4_0-matmult.c.o
In file included from llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:11:
llama.cpp/./llama.h:77:22: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   77 |     LLAMA_API struct llama_context_params llama_context_default_params();
      |                      ^~~~~~~~~~~~~~~~~~~~
llama.cpp/./llama.h:79:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   79 |     LLAMA_API bool llama_mmap_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:80:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
   80 |     LLAMA_API bool llama_mlock_supported();
      |     ^~~~~~~~~
llama.cpp/./llama.h:158:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  158 |     LLAMA_API llama_token llama_token_bos();
      |     ^~~~~~~~~
llama.cpp/./llama.h:159:5: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
  159 |     LLAMA_API llama_token llama_token_eos();
      |     ^~~~~~~~~
llama.cpp/examples/benchmark/benchmark-q4_0-matmult.c:14:10: fatal error: cstring: No such file or directory
   14 | #include <cstring>
      |          ^~~~~~~~~

Apr 17 '23 04:04 tyzoid

I'm able to reproduce the performance improvements, impressive work!

Apr 18 '23 17:04 Titaniumtown

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

Apr 19 '23 13:04 e271828-

conflicts and failing CI

Apr 20 '23 08:04 acheong08

This might be a dumb question, but would something like matrix to vector multiplications count as tall and skinny? In other words, something like a 2d tensor with a 1d tensor.

No, tall and skinny looks like:

 __
|  |     ___________
|  |  X  |__________| 
|__|

Matrix to vector is

______________        __
|             |      |  |
|             |  X   |  |
| ____________|      |__|

Apr 26 '23 16:04 jon-chuang

No, tall and skinny looks like:

Thanks, I don't understand how it's tall and skinny though. :)

                I'm not fat, I'm just big boned.
 __            /
|  |     ,____O_____,
|  |  X =|__________|= 
|__|       /      \

Apr 26 '23 18:04 KerfuffleV2

Thanks, I don't understand how it's tall and skinny though. :)

:rofl:

Well, in our specific context, the matrices in the matmul are $B A^T$, so both B, A are tall and skinny. ($A^T$ being short and wide)

A mat mul is tall and skinny as long as the dimension along which the matrices are multiplied is small compared to the adjacent dimension of one of the matrices. So the specific orientation does not matter.

Apr 26 '23 18:04 jon-chuang

// i can only get it working with sse1 like the following because i have no FMA on my machine // c_vec = _mm256_fmadd_ps(a, b_vec, c_vec); // FMA: c_vec += a * b_vec c_vec = _mm_add_ps(c_vec, _mm_mul_ps(b_vec, a)); // i suppose it ruins the effort if one considers avx2

Apr 28 '23 02:04 syl-00110111

PTAL @slaren @ggerganov. Please refer to informal LoRA benchmarks for e2e validation.

@syl-00110111 Thanks for that info. I will simply not support no FMA. You may expand on this PR (e.g. in follow up PR) if you wish for non-FMA support by providing appropriate benchmarks.

Apr 30 '23 10:04 jon-chuang

@e271828-

Nice work. This will be very relevant for https://github.com/saharNooby/rwkv.cpp as well.

Can you clarify - does the RWKV inference benefit from this change, and if so - can you provide some rough numbers

@jon-chuang

Are we confident that the computation is correct? Maybe we should add an accuracy test comparing the results against the default matrix multiplication

May 01 '23 09:05 ggerganov

No updates on this?

May 21 '23 14:05 acheong08

Hello, I've been on holiday. I wrote a test, there are some bugs, so I'm fixing them.

May 22 '23 09:05 jon-chuang

Apologies, no longer motivated to fix. Anyone who is interested, please take a look and continue.

Jul 07 '23 07:07 jon-chuang

llama.cpp
llama.cpp copied to clipboard

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup

llama.cpp llama.cpp copied to clipboard

perf: AVX2/AVX routines for tall and skinny matmul - up to 15X speedup

llama.cpp
llama.cpp copied to clipboard