llama.cpp Support requantizing models instead of only allowing quantization from 16/32bit

This pull changes the llama.cpp API a bit for llama_model_quantize (but there was a comment saying it wasn't ideal and probably would change) so that it takes a structure with parameters. In addition to the existing parameters that were passed separately, I added a toggle for quantizing the output tensor and a toggle for allowing quantization of non-f32/f16 models.

llama_quantize_model_internal will now just dequantize data (if possible) and the option is enabled into the same buffer used for converting f16 to f32.

I also updated the quantize example to allow --allow-requantize and --leave-output-tensor flags.

I did a little experimentation with requantizing and the effect on perplexity (my system is pretty old so I only ran perplexity for 20 chunks).

Baseline is from a reliable source, so I presume it was quantized from 16bit or 32bit.

edit: Reorganized to make comparison easier.

Requantizing (7b)

In order:

baseline 16/32bit to q4_0
q8_0 to q4_0
q8_0 to q5_1
q8_0 to q5_1 to q4_0

[1]4.4543,[2]4.9401,[3]5.8275,[4]6.4841,[5]6.5853,[6]6.5085,[7]6.6925,[8]6.8058,[9]7.1425,[10]7.3864,[11]7.5937,[12]7.6130,[13]7.5411,[14]7.6129,[15]7.8702,[16]7.4694,[17]7.3520,[18]7.3030,[19]6.9404,[20]6.9317
[1]4.5288,[2]5.0356,[3]5.9668,[4]6.6169,[5]6.7260,[6]6.6440,[7]6.8366,[8]6.9398,[9]7.2729,[10]7.5059,[11]7.7240,[12]7.7457,[13]7.6775,[14]7.7551,[15]8.0253,[16]7.6084,[17]7.4895,[18]7.4435,[19]7.0696,[20]7.0562
[1]4.2760,[2]4.7250,[3]5.6158,[4]6.2080,[5]6.3382,[6]6.2962,[7]6.4810,[8]6.5727,[9]6.9023,[10]7.1424,[11]7.3396,[12]7.3646,[13]7.2815,[14]7.3281,[15]7.5752,[16]7.2036,[17]7.0927,[18]7.0414,[19]6.6903,[20]6.6769
[1]4.4667,[2]5.0004,[3]5.9462,[4]6.5659,[5]6.6665,[6]6.5793,[7]6.7940,[8]6.8855,[9]7.2347,[10]7.4503,[11]7.6859,[12]7.7178,[13]7.6519,[14]7.7240,[15]7.9988,[16]7.5971,[17]7.4865,[18]7.4467,[19]7.0744,[20]7.0560

Requantizing (7b) with `--leave-output-tensor`:

In order:

baseline 16/32bit to q4_0 (model from HF, so output tensor is quantized)
q8_0 to q4_0
q8_0 to q5_1
q8_0 to q5_1 to q4_0

[1]4.4543,[2]4.9401,[3]5.8275,[4]6.4841,[5]6.5853,[6]6.5085,[7]6.6925,[8]6.8058,[9]7.1425,[10]7.3864,[11]7.5937,[12]7.6130,[13]7.5411,[14]7.6129,[15]7.8702,[16]7.4694,[17]7.3520,[18]7.3030,[19]6.9404,[20]6.9317
[1]4.4459,[2]4.9815,[3]5.9177,[4]6.5378,[5]6.6339,[6]6.5731,[7]6.7753,[8]6.8779,[9]7.2067,[10]7.4348,[11]7.6518,[12]7.6790,[13]7.6165,[14]7.6991,[15]7.9601,[16]7.5499,[17]7.4370,[18]7.3908,[19]7.0176,[20]7.0056
[1]4.2412,[2]4.7153,[3]5.5916,[4]6.1828,[5]6.3129,[6]6.2794,[7]6.4634,[8]6.5546,[9]6.8808,[10]7.1252,[11]7.3224,[12]7.3477,[13]7.2649,[14]7.3098,[15]7.5562,[16]7.1868,[17]7.0772,[18]7.0251,[19]6.6750,[20]6.6615
[1]4.4444,[2]4.9846,[3]5.9130,[4]6.5119,[5]6.6093,[6]6.5362,[7]6.7444,[8]6.8297,[9]7.1733,[10]7.3866,[11]7.6076,[12]7.6364,[13]7.5806,[14]7.6528,[15]7.9191,[16]7.5238,[17]7.4177,[18]7.3747,[19]7.0068,[20]6.9927

edit: Added 33b data.

Requantizing (33b)

baseline 16/32bit to q4_0 (model from HF, so output tensor is quantized)
q8_0 to q4_0
q8_0 to q5_1

[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.2961,[2]3.7156,[3]4.4491,[4]4.4430,[5]4.3123,[6]4.3066,[7]4.4731,[8]4.5590,[9]4.8058,[10]5.0216,[11]5.1757,[12]5.2206,[13]5.1922,[14]5.2890,[15]5.4426,[16]5.2195,[17]5.1936,[18]5.2127,[19]5.0071,[20]5.0200
[1]3.2718,[2]3.6866,[3]4.3905,[4]4.3015,[5]4.1522,[6]4.1477,[7]4.3241,[8]4.4095,[9]4.6493,[10]4.8555,[11]4.9919,[12]5.0429,[13]5.0230,[14]5.1008,[15]5.2511,[16]5.0435,[17]5.0283,[18]5.0551,[19]4.8627,[20]4.8862

Requantizing (33b) with `--leave-output-tensor`

baseline 16/32bit to q4_0 (model from HF, so output tensor is quantized)
q8_0 to q4_0
q8_0 to q5_1

[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.3002,[2]3.7089,[3]4.4329,[4]4.4210,[5]4.2862,[6]4.2820,[7]4.4499,[8]4.5344,[9]4.7764,[10]4.9896,[11]5.1435,[12]5.1857,[13]5.1619,[14]5.2533,[15]5.4052,[16]5.1854,[17]5.1579,[18]5.1782,[19]4.9747,[20]4.9863
[1]3.2716,[2]3.6796,[3]4.3823,[4]4.2942,[5]4.1472,[6]4.1420,[7]4.3172,[8]4.4032,[9]4.6420,[10]4.8490,[11]4.9853,[12]5.0357,[13]5.0165,[14]5.0950,[15]5.2441,[16]5.0365,[17]5.0206,[18]5.0471,[19]4.8551,[20]4.8781

There is some loss even from q8_0 but it still might be worth doing in some cases. i.e. you can keep something like a q8_0 around and make other quantizations if you need them based on performance/memory constraints.

I haven't done tests with larger models, but from what I've seen 7B models are generally the ones that quantization affects the most. So while it may be borderline for 7B, it might be a lot more reasonable for something like 33b, 65b models.

Since you have to explicitly enable requantizing, I don't think allowing this is too dangerous for users.

Note: This is lightly tested and seems to work. I once was a C developer but that was a long time ago, C++ I can bumble my way through at best.

Jun 04 '23 17:06 KerfuffleV2

The additional 33b tests are pretty much as expected: requantizing from q8_0 doesn't really decrease the quality very much.

Also, leaving the output tensor unquantized adds around 100MB to an 18GB model but reduces perplexity more than requantizing increases it (compared to the baseline). That seems pretty worthwhile maybe even to use as the default.

At 33b, the q8_0 to q4_0 --leave-output-tensor model actually has lower perplexity than the 16bit (or 32bit) to q4_0 one at the cost of a tiny size increase!

Note: This is only 20 chunks for perplexity so maybe it's possible something could happen to disprove this if the whole calculation runs. Doesn't seem too likely though.

edit: Now on top of master with the k-quants changes.

Not that you should do this, but just for fun: q8_0 33b llama to q2_K:

baseline 16/32bit to q4_0 (model from HF, so output tensor is quantized)
q8_0 to q4_0
q8_0 to q2_K

[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.3002,[2]3.7089,[3]4.4329,[4]4.4210,[5]4.2862,[6]4.2820,[7]4.4499,[8]4.5344,[9]4.7764,[10]4.9896,[11]5.1435,[12]5.1857,[13]5.1619,[14]5.2533,[15]5.4052,[16]5.1854,[17]5.1579,[18]5.1782,[19]4.9747,[20]4.9863
[1]3.4867,[2]3.8735,[3]4.6692,[4]4.9712,[5]4.9514,[6]4.9555,[7]5.1238,[8]5.1624,[9]5.3775,[10]5.6207,[11]5.8092,[12]5.8513,[13]5.8324,[14]5.9409,[15]6.1025,[16]5.8323,[17]5.7683,[18]5.7937,[19]5.5462,[20]5.5515

Jun 04 '23 22:06 KerfuffleV2

Is it possible to implement saving in F16 and F32 too in this PR (dequantization)? This feature will be useful to train LoRA for quantized models.

Jun 10 '23 07:06 maxxk

Is it possible to implement saving in F16 and F32 too in this PR (dequantization)?

I actually had that thought as well. I wasn't even sure if this would get merged so I didn't mess with it. However, I can look at adding the ability to save as f16/f32 in a separate PR, it should be pretty simple.

Jun 10 '23 08:06 KerfuffleV2

llama.cpp llama.cpp copied to clipboard

Support requantizing models instead of only allowing quantization from 16/32bit

Requantizing (7b)

Requantizing (7b) with --leave-output-tensor:

Requantizing (33b)

Requantizing (33b) with --leave-output-tensor

llama.cpp
llama.cpp copied to clipboard

Requantizing (7b) with `--leave-output-tensor`:

Requantizing (33b) with `--leave-output-tensor`