llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Add AVX2 implementation of dequantize_row_q4_0

Open slaren opened this issue 1 year ago • 4 comments

I couldn't notice a big performance improvement, more testing necessary

slaren avatar Mar 24 '23 16:03 slaren

A quick performance test shows significant improvement in the function itself (with k=4096):

Running ./test-dq
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.56, 1.26, 1.46
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_dequantize_row_q4_0           10351 ns        10351 ns        66698
BM_dequantize_row_q4_0_avx2       1384 ns         1384 ns       509491

slaren avatar Mar 24 '23 17:03 slaren

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

slaren avatar Mar 24 '23 17:03 slaren

@ggerganov we need some sort of benchmarking suite for ggml.

@slaren how complex is the ./test-dq ? can you provide the code, does it require the model files or is it standalone? (should be easy to create synthetic data)

Green-Sky avatar Mar 24 '23 17:03 Green-Sky

It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7 To do that, I split the avx and scalar implementations into dequantize_row_q4_0_avx2 and dequantize_row_q4_0 beforehand.

slaren avatar Mar 24 '23 17:03 slaren

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

The dequantize functions are only used if you link against BLAS and use -b 32 or bigger:

make clean
LLAMA_OPENBLAS=1 make

Otherwise, they will never be called.

ggerganov avatar Mar 24 '23 20:03 ggerganov

@ggerganov that's not what I am seeing, here is a stack trace for example:

#2  0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767
#3  0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=<optimized out>, params=<optimized out>,
    dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030) at ggml.c:7249
#4  ggml_compute_forward_get_rows (dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030, params=<optimized out>)
    at ggml.c:7345
#5  ggml_compute_forward (params=<optimized out>, tensor=0x7ffe585b8100) at ggml.c:9027
#6  0x0000555555571435 in ggml_graph_compute (ctx=<optimized out>, cgraph=0x7ffffffe4d90) at ggml.c:9911
#7  0x00005555555793f5 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=4, n_past=0, n_threads=<optimized out>)
    at llama.cpp:822
#8  0x000055555557976d in llama_eval (ctx=<optimized out>, tokens=<optimized out>, n_tokens=<optimized out>,
    n_past=<optimized out>, n_threads=<optimized out>) at llama.cpp:1493
#9  0x000055555555c396 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:224

slaren avatar Mar 24 '23 20:03 slaren

Ah yes - there is one exception -- the ggml_get_rows at the start of the inference. It is a very lightweight call so I don't expect it to take a measurable amount of time.

ggerganov avatar Mar 24 '23 20:03 ggerganov

Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu..

slaren avatar Mar 24 '23 20:03 slaren

@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment?

slaren avatar Mar 24 '23 21:03 slaren

Yes, it is broken. Weird ..

ggerganov avatar Mar 24 '23 21:03 ggerganov

Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS. Tested on M1 so far, but I expect the same results for x86

ggerganov avatar Mar 25 '23 14:03 ggerganov

I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours.

slaren avatar Mar 25 '23 16:03 slaren