llama.cpp Q4_0 scale selection using RMSE

This combines some ideas from PR #729 and issue #397 to select a scale factor for Q4_0 with low RMS error.

In order to KISS, I simply made a table of 8 hard-coded values, after analysing the optimum values in steps of 0.1. The result of that analysis is documented in examples/quantize/scale.py and reproduced here: scale

Error statistics (#728):

q4_0 : rmse 0.00221840, maxerr 0.14257812, 95pct<0.0040, median<0.0018 (master) q4_0 : rmse 0.00196398, maxerr 0.18200684, 95pct<0.0036, median<0.0016 (#729) q4_0 : rmse 0.00185915, maxerr 0.14257812, 95pct<0.0034, median<0.0014 (this PR)

quantize.cpp run time on 7B:

80s (master cc9cee8) 135s (this PR, AVX2) 385s (this PR, scalar)

I introduce a minor version number at the very end of the file. This allows us to nudge the user to re-create their files without breaking anything. I had to modify the read loop, as it used to try to read past EOF.

I removed the test of ggml_quantize_q4_0 which I originally wrote and was quite minimal. This is admittedly lazy, but I couldn't think of a good test right away. Maybe we just need to provide a model file that's not too big for the CI machines and check for equivalence after quantization.

The alignment macros are a bit of a hack. I don't have Windows to test here and don't want to keep hitting the CI with trial-and-error. Is there a clean cross-platform way to do it? And come to think about alignment, why are the input floats not aligned? (edit: probably because llama_model_quantize_internal doesn't use mmap, let me see if we can force the alignment of the buffers).

Currently running perplexity, but it's taking 12 hours here so I may not wait for that.

This does not obsolete #729, as my PR only changes the method for the model generation. We might still use @unbounded's work and set the scale to -8 instead of +7 for the other uses of the quantization function.

Apr 07 '23 14:04 sw

Very interesting analysis and data 😄 Curious to see the new perplexity values

Btw, I've been thinking a little bit about how to determine the scale factor to minimize the RMS and I am fairly certain that there is a straightforward way to compute the optimum value without search. I don't have the formula yet - just a strong intuition lol

Apr 07 '23 17:04 ggerganov

I am fairly certain that there is a straightforward way to compute the optimum value without search.

I'd love to see that, but while the error function seems to be ~~smooth~~ continuous and piecewise differentiable (at the point where the rounding flips, abs(error) stays the same), it doesn't seem evident to me.

Here's the plot of the very first block in the 7B model (first input value = 9.888411e-05): first-block

(This is just the sum of squared errors, I didn't bother with the square root and scaling by QK.)

Apr 07 '23 18:04 sw

So as mentioned in https://github.com/ggerganov/llama.cpp/issues/397#issuecomment-1500718744 I believe I have an RMSE-optimal but very slow implementation of the scaling search...

And your implementation gets extremely close!:

You posted:

q4_0 : rmse 0.00185915, maxerr 0.14257812, 95pct<0.0034, median<0.0014 (this PR)

"optimal":

q4_0 : rmse 0.00184913, maxerr 0.14257812, 95pct<0.0034, median<0.0014

That probably about as good as we can hope for.

Full output for verification - if you get a lower RMSE for any layer I have a bug :)

quantize-stats output

note: source model is f16
testing 226 layers with max size 131072000
q4_0::layers.0.attention.wk.weight                : rmse 0.00292301, maxerr 0.07012939, 95pct

Apr 07 '23 23:04 unbounded

Now that the statistics tool has landed in master, I've rebased my branch and updated the tool to accept an --implementation argument instead of --reference.

@unbounded : I will definitively have a look at your approach, thanks a lot.

edit: pulled in your commit and updated the stats tool. It is indeed slow ;-). 80% of time is spent in qsort, so AVX2-ifying isn't going to help a lot.

quantize.cpp still uses my simple method.

Apr 08 '23 08:04 sw

Initial perplexity test. q4_0, MINOR 0, w/ BLAS (OpenBLAS):

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
74.45 seconds per pass - ETA 13.55 hours
[1]4.3797,[2]4.9554,^C

q4_0, MINOR 0, w/o BLAS:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
26.22 seconds per pass - ETA 4.77 hours
[1]4.5741,[2]5.0601,^C

Commit 678e1389701109842b39ea1c3415ef85e212836b (shown as 7B_q4_0_1 in plot below) q4_0, MINOR 1, w/o BLAS:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
26.93 seconds per pass - ETA 4.90 hours
[1]4.7137,[2]5.2331,
...

Final score [655]6.5655. 7B_q4_0_1.txt

perp_vs_model

And closer: perp_vs_model

Apr 09 '23 17:04 ivanstepanovftw

Leaving another comment to let you know final perplexity [655]6.5655. Perplexity discussion for previous results.

Apr 10 '23 00:04 ivanstepanovftw

@ivanstepanovftw Thanks for your effort. The first few values match mine exactly, so I'll trust your results. It's good to see at least a small improvement.

But as I said in #397, maybe the RMSE of the quantization is a distraction. This method leads to a mean scale value of 8.092, so there will be clipping of the maximum value. I would like to see us experiment with #729 but with more (larger) scale values instead of just 7 or 8.

Apr 10 '23 14:04 sw

llama.cpp llama.cpp copied to clipboard

Q4_0 scale selection using RMSE

llama.cpp
llama.cpp copied to clipboard