llama.cpp Add AVX2 implementation of dequantize_row_q4

I couldn't notice a big performance improvement, more testing necessary

Mar 24 '23 16:03 slaren

A quick performance test shows significant improvement in the function itself (with k=4096):

Running ./test-dq
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.56, 1.26, 1.46
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_dequantize_row_q4_0           10351 ns        10351 ns        66698
BM_dequantize_row_q4_0_avx2       1384 ns         1384 ns       509491

Mar 24 '23 17:03 slaren

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

Mar 24 '23 17:03 slaren

@ggerganov we need some sort of benchmarking suite for ggml.

@slaren how complex is the ./test-dq ? can you provide the code, does it require the model files or is it standalone? (should be easy to create synthetic data)

Mar 24 '23 17:03 Green-Sky

It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7 To do that, I split the avx and scalar implementations into dequantize_row_q4_0_avx2 and dequantize_row_q4_0 beforehand.

Mar 24 '23 17:03 slaren

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

The dequantize functions are only used if you link against BLAS and use -b 32 or bigger:

make clean
LLAMA_OPENBLAS=1 make

Otherwise, they will never be called.

Mar 24 '23 20:03 ggerganov

@ggerganov that's not what I am seeing, here is a stack trace for example:

#2  0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767
#3  0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=<optimized out>, params=<optimized out>,
    dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030) at ggml.c:7249
#4  ggml_compute_forward_get_rows (dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030, params=<optimized out>)
    at ggml.c:7345
#5  ggml_compute_forward (params=<optimized out>, tensor=0x7ffe585b8100) at ggml.c:9027
#6  0x0000555555571435 in ggml_graph_compute (ctx=<optimized out>, cgraph=0x7ffffffe4d90) at ggml.c:9911
#7  0x00005555555793f5 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=4, n_past=0, n_threads=<optimized out>)
    at llama.cpp:822
#8  0x000055555557976d in llama_eval (ctx=<optimized out>, tokens=<optimized out>, n_tokens=<optimized out>,
    n_past=<optimized out>, n_threads=<optimized out>) at llama.cpp:1493
#9  0x000055555555c396 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:224

Mar 24 '23 20:03 slaren

Ah yes - there is one exception -- the ggml_get_rows at the start of the inference. It is a very lightweight call so I don't expect it to take a measurable amount of time.

Mar 24 '23 20:03 ggerganov

Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu..

Mar 24 '23 20:03 slaren

@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment?

Mar 24 '23 21:03 slaren

Yes, it is broken. Weird ..

Mar 24 '23 21:03 ggerganov

Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS. Tested on M1 so far, but I expect the same results for x86

Mar 25 '23 14:03 ggerganov

I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours.

Mar 25 '23 16:03 slaren

llama.cpp
llama.cpp copied to clipboard

Add AVX2 implementation of dequantize_row_q4_0

llama.cpp llama.cpp copied to clipboard

Add AVX2 implementation of dequantize_row_q4_0

llama.cpp
llama.cpp copied to clipboard