llama.cpp
llama.cpp copied to clipboard
Add AVX2 implementation of dequantize_row_q4_0
I couldn't notice a big performance improvement, more testing necessary
A quick performance test shows significant improvement in the function itself (with k=4096):
Running ./test-dq
Run on (16 X 3600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 1.56, 1.26, 1.46
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_dequantize_row_q4_0 10351 ns 10351 ns 66698
BM_dequantize_row_q4_0_avx2 1384 ns 1384 ns 509491
The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values.
[1]4.5690,[2]5.2058,[3]6.0526,
@ggerganov we need some sort of benchmarking suite for ggml.
@slaren how complex is the ./test-dq
? can you provide the code, does it require the model files or is it standalone? (should be easy to create synthetic data)
It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7
To do that, I split the avx and scalar implementations into dequantize_row_q4_0_avx2
and dequantize_row_q4_0
beforehand.
The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values.
[1]4.5690,[2]5.2058,[3]6.0526,
The dequantize
functions are only used if you link against BLAS and use -b 32
or bigger:
make clean
LLAMA_OPENBLAS=1 make
Otherwise, they will never be called.
@ggerganov that's not what I am seeing, here is a stack trace for example:
#2 0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767
#3 0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=<optimized out>, params=<optimized out>,
dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030) at ggml.c:7249
#4 ggml_compute_forward_get_rows (dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030, params=<optimized out>)
at ggml.c:7345
#5 ggml_compute_forward (params=<optimized out>, tensor=0x7ffe585b8100) at ggml.c:9027
#6 0x0000555555571435 in ggml_graph_compute (ctx=<optimized out>, cgraph=0x7ffffffe4d90) at ggml.c:9911
#7 0x00005555555793f5 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=4, n_past=0, n_threads=<optimized out>)
at llama.cpp:822
#8 0x000055555557976d in llama_eval (ctx=<optimized out>, tokens=<optimized out>, n_tokens=<optimized out>,
n_past=<optimized out>, n_threads=<optimized out>) at llama.cpp:1493
#9 0x000055555555c396 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:224
Ah yes - there is one exception -- the ggml_get_rows
at the start of the inference. It is a very lightweight call so I don't expect it to take a measurable amount of time.
Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu..
@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment?
Yes, it is broken. Weird ..
Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS. Tested on M1 so far, but I expect the same results for x86
I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours.