llama.cpp
llama.cpp copied to clipboard
Support requantizing models instead of only allowing quantization from 16/32bit
This pull changes the llama.cpp API a bit for llama_model_quantize
(but there was a comment saying it wasn't ideal and probably would change) so that it takes a structure with parameters. In addition to the existing parameters that were passed separately, I added a toggle for quantizing the output tensor and a toggle for allowing quantization of non-f32/f16 models.
llama_quantize_model_internal
will now just dequantize data (if possible) and the option is enabled into the same buffer used for converting f16 to f32.
I also updated the quantize
example to allow --allow-requantize
and --leave-output-tensor
flags.
I did a little experimentation with requantizing and the effect on perplexity (my system is pretty old so I only ran perplexity for 20 chunks).
Baseline is from a reliable source, so I presume it was quantized from 16bit or 32bit.
edit: Reorganized to make comparison easier.
Requantizing (7b)
In order:
- baseline 16/32bit to
q4_0
-
q8_0
toq4_0
-
q8_0
toq5_1
-
q8_0
toq5_1
toq4_0
[1]4.4543,[2]4.9401,[3]5.8275,[4]6.4841,[5]6.5853,[6]6.5085,[7]6.6925,[8]6.8058,[9]7.1425,[10]7.3864,[11]7.5937,[12]7.6130,[13]7.5411,[14]7.6129,[15]7.8702,[16]7.4694,[17]7.3520,[18]7.3030,[19]6.9404,[20]6.9317
[1]4.5288,[2]5.0356,[3]5.9668,[4]6.6169,[5]6.7260,[6]6.6440,[7]6.8366,[8]6.9398,[9]7.2729,[10]7.5059,[11]7.7240,[12]7.7457,[13]7.6775,[14]7.7551,[15]8.0253,[16]7.6084,[17]7.4895,[18]7.4435,[19]7.0696,[20]7.0562
[1]4.2760,[2]4.7250,[3]5.6158,[4]6.2080,[5]6.3382,[6]6.2962,[7]6.4810,[8]6.5727,[9]6.9023,[10]7.1424,[11]7.3396,[12]7.3646,[13]7.2815,[14]7.3281,[15]7.5752,[16]7.2036,[17]7.0927,[18]7.0414,[19]6.6903,[20]6.6769
[1]4.4667,[2]5.0004,[3]5.9462,[4]6.5659,[5]6.6665,[6]6.5793,[7]6.7940,[8]6.8855,[9]7.2347,[10]7.4503,[11]7.6859,[12]7.7178,[13]7.6519,[14]7.7240,[15]7.9988,[16]7.5971,[17]7.4865,[18]7.4467,[19]7.0744,[20]7.0560
Requantizing (7b) with --leave-output-tensor
:
In order:
- baseline 16/32bit to
q4_0
(model from HF, so output tensor is quantized) -
q8_0
toq4_0
-
q8_0
toq5_1
-
q8_0
toq5_1
toq4_0
[1]4.4543,[2]4.9401,[3]5.8275,[4]6.4841,[5]6.5853,[6]6.5085,[7]6.6925,[8]6.8058,[9]7.1425,[10]7.3864,[11]7.5937,[12]7.6130,[13]7.5411,[14]7.6129,[15]7.8702,[16]7.4694,[17]7.3520,[18]7.3030,[19]6.9404,[20]6.9317
[1]4.4459,[2]4.9815,[3]5.9177,[4]6.5378,[5]6.6339,[6]6.5731,[7]6.7753,[8]6.8779,[9]7.2067,[10]7.4348,[11]7.6518,[12]7.6790,[13]7.6165,[14]7.6991,[15]7.9601,[16]7.5499,[17]7.4370,[18]7.3908,[19]7.0176,[20]7.0056
[1]4.2412,[2]4.7153,[3]5.5916,[4]6.1828,[5]6.3129,[6]6.2794,[7]6.4634,[8]6.5546,[9]6.8808,[10]7.1252,[11]7.3224,[12]7.3477,[13]7.2649,[14]7.3098,[15]7.5562,[16]7.1868,[17]7.0772,[18]7.0251,[19]6.6750,[20]6.6615
[1]4.4444,[2]4.9846,[3]5.9130,[4]6.5119,[5]6.6093,[6]6.5362,[7]6.7444,[8]6.8297,[9]7.1733,[10]7.3866,[11]7.6076,[12]7.6364,[13]7.5806,[14]7.6528,[15]7.9191,[16]7.5238,[17]7.4177,[18]7.3747,[19]7.0068,[20]6.9927
edit: Added 33b data.
Requantizing (33b)
- baseline 16/32bit to
q4_0
(model from HF, so output tensor is quantized) -
q8_0
toq4_0
-
q8_0
toq5_1
[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.2961,[2]3.7156,[3]4.4491,[4]4.4430,[5]4.3123,[6]4.3066,[7]4.4731,[8]4.5590,[9]4.8058,[10]5.0216,[11]5.1757,[12]5.2206,[13]5.1922,[14]5.2890,[15]5.4426,[16]5.2195,[17]5.1936,[18]5.2127,[19]5.0071,[20]5.0200
[1]3.2718,[2]3.6866,[3]4.3905,[4]4.3015,[5]4.1522,[6]4.1477,[7]4.3241,[8]4.4095,[9]4.6493,[10]4.8555,[11]4.9919,[12]5.0429,[13]5.0230,[14]5.1008,[15]5.2511,[16]5.0435,[17]5.0283,[18]5.0551,[19]4.8627,[20]4.8862
Requantizing (33b) with --leave-output-tensor
- baseline 16/32bit to
q4_0
(model from HF, so output tensor is quantized) -
q8_0
toq4_0
-
q8_0
toq5_1
[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.3002,[2]3.7089,[3]4.4329,[4]4.4210,[5]4.2862,[6]4.2820,[7]4.4499,[8]4.5344,[9]4.7764,[10]4.9896,[11]5.1435,[12]5.1857,[13]5.1619,[14]5.2533,[15]5.4052,[16]5.1854,[17]5.1579,[18]5.1782,[19]4.9747,[20]4.9863
[1]3.2716,[2]3.6796,[3]4.3823,[4]4.2942,[5]4.1472,[6]4.1420,[7]4.3172,[8]4.4032,[9]4.6420,[10]4.8490,[11]4.9853,[12]5.0357,[13]5.0165,[14]5.0950,[15]5.2441,[16]5.0365,[17]5.0206,[18]5.0471,[19]4.8551,[20]4.8781
There is some loss even from q8_0
but it still might be worth doing in some cases. i.e. you can keep something like a q8_0
around and make other quantizations if you need them based on performance/memory constraints.
I haven't done tests with larger models, but from what I've seen 7B models are generally the ones that quantization affects the most. So while it may be borderline for 7B, it might be a lot more reasonable for something like 33b, 65b models.
Since you have to explicitly enable requantizing, I don't think allowing this is too dangerous for users.
Note: This is lightly tested and seems to work. I once was a C developer but that was a long time ago, C++ I can bumble my way through at best.
The additional 33b tests are pretty much as expected: requantizing from q8_0
doesn't really decrease the quality very much.
Also, leaving the output tensor unquantized adds around 100MB to an 18GB model but reduces perplexity more than requantizing increases it (compared to the baseline). That seems pretty worthwhile maybe even to use as the default.
At 33b, the q8_0
to q4_0
--leave-output-tensor
model actually has lower perplexity than the 16bit (or 32bit) to q4_0
one at the cost of a tiny size increase!
Note: This is only 20 chunks for perplexity so maybe it's possible something could happen to disprove this if the whole calculation runs. Doesn't seem too likely though.
edit: Now on top of master
with the k-quants
changes.
Not that you should do this, but just for fun: q8_0
33b llama to q2_K
:
- baseline 16/32bit to q4_0 (model from HF, so output tensor is quantized)
-
q8_0
toq4_0
-
q8_0
toq2_K
[1]3.3109,[2]3.7188,[3]4.4459,[4]4.4308,[5]4.3045,[6]4.2951,[7]4.4645,[8]4.5540,[9]4.7997,[10]5.0184,[11]5.1678,[12]5.2154,[13]5.1869,[14]5.2832,[15]5.4346,[16]5.2159,[17]5.1890,[18]5.2093,[19]5.0047,[20]5.0191
[1]3.3002,[2]3.7089,[3]4.4329,[4]4.4210,[5]4.2862,[6]4.2820,[7]4.4499,[8]4.5344,[9]4.7764,[10]4.9896,[11]5.1435,[12]5.1857,[13]5.1619,[14]5.2533,[15]5.4052,[16]5.1854,[17]5.1579,[18]5.1782,[19]4.9747,[20]4.9863
[1]3.4867,[2]3.8735,[3]4.6692,[4]4.9712,[5]4.9514,[6]4.9555,[7]5.1238,[8]5.1624,[9]5.3775,[10]5.6207,[11]5.8092,[12]5.8513,[13]5.8324,[14]5.9409,[15]6.1025,[16]5.8323,[17]5.7683,[18]5.7937,[19]5.5462,[20]5.5515
Is it possible to implement saving in F16 and F32 too in this PR (dequantization)? This feature will be useful to train LoRA for quantized models.
Is it possible to implement saving in F16 and F32 too in this PR (dequantization)?
I actually had that thought as well. I wasn't even sure if this would get merged so I didn't mess with it. However, I can look at adding the ability to save as f16/f32 in a separate PR, it should be pretty simple.