llama.cpp
llama.cpp copied to clipboard
RMSE-optimized quants for all quantization types
The PR adds a new build option (LLAMA_NO_RMSE
), which is off by default. When off, all current quantization types (Q4_0, Q4_1, Q4_2, Q4_3
) are performed with RMSE minimization (on master RMSE minimization is enabled for Q4_2
only and cannot easily be disabled).
This makes generation of quantized models quite a bit longer, but still in the same ballpark as it used to take before it was multi-threaded in PR #1075.
With this option enabled, Q4_3
gives a perplexity of 6.0344
for the 7B model, so 0.0273 lower than simple Q4_3
quantization as reported by @ggerganov in #406. If I also enable his trick of not quantizing output tensors, perplexity becomes 6.0085
.
Perplexity result for Q4_3
without quantization of output tensors for the 13B model is 5.3117
.
Details for these perplexity runs can be found in here (issue #406)
As far as I can tell, we are now on par with best known GPTQ
result for 7B, and better for 13B by about 0.05
.
sounds like a good idea. for me personally io is the bottleneck, since i store them on a NAS.
It might be a good idea to get #953 merged first, which implements unit tests for the quantization. But that requires an improvement to the test samples.
I'm still a bit skeptical if chasing after RMSE is the right thing to do.
Let me explain what I mean: originally the Q4 methods calculate max(abs()) and divide that by 7. #729 intends to calculate the signed max, then divide by 8. This PR tries to find the divisor for minimum RMS error. But maybe the princess is in another castle?
What if it actually helps perplexity if we clip the largest values somewhat, even if that comes at a higher RMS error?
^
p |
e |
r | *
p | orig *
l | * #729
e | * *
x | - - - - - - - - - - - - - - - - < RMSE optimum #1106
i |
t | * < perplexity optimum?
y |
+-----|------|------|------------->
7 8 ?
scale factor
So the approach to find that would be use #729, choose a value in the interesting range of maybe [7,11], quantize the model, do a perplexity run, lather, rinse, repeat.
@ikawrakow
Just made a full cuBLAS run on 13B using Q4_3
, without RMSE optimization and output
in F16 precision and got: 5.3075
main: seed = 1682170268
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 6 (mostly Q4_3)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.93 seconds per pass - ETA 32 minutes
[1]3.7052,[2]4.1553,[3]4.9530,[4]5.3817,[5]5.5598,[6]5.4938,[7]5.6338,[8]5.7492,[9]6.0136,[10]6.2525,[11]6.4388,[12]6.4983,[13]6.4590,[14]6.5567,[15]6.7657,[16]6.4420,[17]6.3526,[18]6.3318,[19]6.0375,[20]6.0170,[21]5.9417,[22]5.7639,[23]5.7352,[24]5.6400,[25]5.6548,[26]5.5023,[27]5.3302,[28]5.2330,[29]5.1565,[30]5.0200,[31]4.9747,[32]4.9854,[33]4.9409,[34]4.9796,[35]4.9984,[36]5.0189,[37]5.0113,[38]5.0078,[39]5.0349,[40]5.0774,[41]5.0999,[42]5.1325,[43]5.0970,[44]5.1402,[45]5.1450,[46]5.1202,[47]5.1464,[48]5.1286,[49]5.1304,[50]5.0999,[51]5.1075,[52]5.1012,[53]5.1478,[54]5.1379,[55]5.1200,[56]5.1404,[57]5.1594,[58]5.1818,[59]5.2003,[60]5.2387,[61]5.2315,[62]5.2862,[63]5.3117,[64]5.3227,[65]5.3586,[66]5.3594,[67]5.3771,[68]5.3901,[69]5.4182,[70]5.4484,[71]5.4717,[72]5.5064,[73]5.5534,[74]5.5610,[75]5.5703,[76]5.5838,[77]5.5960,[78]5.5827,[79]5.6087,[80]5.6043,[81]5.6133,[82]5.6107,[83]5.5655,[84]5.5553,[85]5.5483,[86]5.5331,[87]5.4686,[88]5.4265,[89]5.4044,[90]5.3939,[91]5.4152,[92]5.4128,[93]5.4153,[94]5.4153,[95]5.4412,[96]5.4383,[97]5.4336,[98]5.4300,[99]5.4225,[100]5.4204,[101]5.4440,[102]5.4397,[103]5.4550,[104]5.4598,[105]5.4610,[106]5.4753,[107]5.4745,[108]5.4894,[109]5.4882,[110]5.4833,[111]5.5022,[112]5.5191,[113]5.5182,[114]5.5175,[115]5.5215,[116]5.5093,[117]5.5097,[118]5.5330,[119]5.5514,[120]5.5800,[121]5.5945,[122]5.6158,[123]5.6525,[124]5.6684,[125]5.6634,[126]5.6990,[127]5.7300,[128]5.7574,[129]5.7454,[130]5.7539,[131]5.7490,[132]5.7446,[133]5.7318,[134]5.7402,[135]5.7392,[136]5.7311,[137]5.7266,[138]5.7136,[139]5.7058,[140]5.7050,[141]5.6776,[142]5.6734,[143]5.6487,[144]5.6326,[145]5.6238,[146]5.6132,[147]5.6179,[148]5.6202,[149]5.6169,[150]5.6165,[151]5.6212,[152]5.6153,[153]5.6064,[154]5.6005,[155]5.6066,[156]5.6042,[157]5.6202,[158]5.6226,[159]5.6232,[160]5.6268,[161]5.6384,[162]5.6133,[163]5.6034,[164]5.5826,[165]5.5576,[166]5.5342,[167]5.5020,[168]5.4757,[169]5.4622,[170]5.4531,[171]5.4325,[172]5.4202,[173]5.4072,[174]5.3805,[175]5.3599,[176]5.3462,[177]5.3294,[178]5.3096,[179]5.2962,[180]5.2892,[181]5.2729,[182]5.2565,[183]5.2445,[184]5.2435,[185]5.2367,[186]5.2377,[187]5.2436,[188]5.2419,[189]5.2583,[190]5.2585,[191]5.2758,[192]5.2892,[193]5.3032,[194]5.3145,[195]5.3332,[196]5.3447,[197]5.3635,[198]5.3770,[199]5.3788,[200]5.3797,[201]5.3730,[202]5.3862,[203]5.3922,[204]5.3871,[205]5.3960,[206]5.4014,[207]5.3972,[208]5.4033,[209]5.4065,[210]5.4120,[211]5.4227,[212]5.4292,[213]5.4386,[214]5.4415,[215]5.4445,[216]5.4570,[217]5.4734,[218]5.4867,[219]5.4863,[220]5.4836,[221]5.4789,[222]5.4792,[223]5.4732,[224]5.4665,[225]5.4628,[226]5.4829,[227]5.4883,[228]5.4956,[229]5.5025,[230]5.4989,[231]5.5143,[232]5.5036,[233]5.4888,[234]5.4747,[235]5.4525,[236]5.4473,[237]5.4386,[238]5.4417,[239]5.4306,[240]5.4218,[241]5.4251,[242]5.4265,[243]5.4257,[244]5.4163,[245]5.4128,[246]5.4028,[247]5.3930,[248]5.3868,[249]5.3837,[250]5.3874,[251]5.3792,[252]5.3743,[253]5.3653,[254]5.3607,[255]5.3515,[256]5.3350,[257]5.3249,[258]5.3183,[259]5.3173,[260]5.3090,[261]5.3038,[262]5.2997,[263]5.2947,[264]5.2711,[265]5.2707,[266]5.2679,[267]5.2618,[268]5.2684,[269]5.2676,[270]5.2685,[271]5.2749,[272]5.2778,[273]5.2794,[274]5.2802,[275]5.2861,[276]5.2918,[277]5.3039,[278]5.3125,[279]5.3207,[280]5.3244,[281]5.3339,[282]5.3395,[283]5.3517,[284]5.3602,[285]5.3681,[286]5.3805,[287]5.3778,[288]5.3831,[289]5.3770,[290]5.3628,[291]5.3498,[292]5.3364,[293]5.3246,[294]5.3254,[295]5.3256,[296]5.3304,[297]5.3295,[298]5.3317,[299]5.3295,[300]5.3208,[301]5.3211,[302]5.3147,[303]5.3065,[304]5.2992,[305]5.2967,[306]5.2864,[307]5.2893,[308]5.2904,[309]5.2772,[310]5.2743,[311]5.2698,[312]5.2711,[313]5.2657,[314]5.2642,[315]5.2510,[316]5.2470,[317]5.2344,[318]5.2184,[319]5.2289,[320]5.2399,[321]5.2447,[322]5.2418,[323]5.2358,[324]5.2339,[325]5.2436,[326]5.2452,[327]5.2460,[328]5.2495,[329]5.2540,[330]5.2561,[331]5.2663,[332]5.2627,[333]5.2701,[334]5.2656,[335]5.2605,[336]5.2629,[337]5.2619,[338]5.2615,[339]5.2571,[340]5.2539,[341]5.2602,[342]5.2634,[343]5.2674,[344]5.2677,[345]5.2692,[346]5.2676,[347]5.2712,[348]5.2750,[349]5.2773,[350]5.2754,[351]5.2767,[352]5.2769,[353]5.2716,[354]5.2725,[355]5.2774,[356]5.2802,[357]5.2774,[358]5.2854,[359]5.2874,[360]5.2843,[361]5.2843,[362]5.2913,[363]5.3020,[364]5.3072,[365]5.3110,[366]5.3126,[367]5.3213,[368]5.3190,[369]5.3204,[370]5.3224,[371]5.3185,[372]5.3231,[373]5.3270,[374]5.3251,[375]5.3248,[376]5.3306,[377]5.3271,[378]5.3296,[379]5.3330,[380]5.3264,[381]5.3235,[382]5.3196,[383]5.3176,[384]5.3176,[385]5.3166,[386]5.3152,[387]5.3152,[388]5.3126,[389]5.3088,[390]5.3036,[391]5.2979,[392]5.2944,[393]5.2939,[394]5.2970,[395]5.2963,[396]5.2909,[397]5.2973,[398]5.3014,[399]5.3083,[400]5.3077,[401]5.3085,[402]5.3097,[403]5.3119,[404]5.3173,[405]5.3023,[406]5.2982,[407]5.2970,[408]5.2980,[409]5.3090,[410]5.3178,[411]5.3271,[412]5.3412,[413]5.3513,[414]5.3571,[415]5.3630,[416]5.3702,[417]5.3798,[418]5.3822,[419]5.3871,[420]5.3947,[421]5.4045,[422]5.4077,[423]5.4134,[424]5.4224,[425]5.4301,[426]5.4360,[427]5.4401,[428]5.4473,[429]5.4509,[430]5.4572,[431]5.4696,[432]5.4727,[433]5.4721,[434]5.4688,[435]5.4701,[436]5.4730,[437]5.4812,[438]5.4887,[439]5.4856,[440]5.4850,[441]5.4808,[442]5.4796,[443]5.4807,[444]5.4824,[445]5.4815,[446]5.4835,[447]5.4859,[448]5.4892,[449]5.4876,[450]5.4888,[451]5.4862,[452]5.4707,[453]5.4614,[454]5.4560,[455]5.4563,[456]5.4601,[457]5.4612,[458]5.4594,[459]5.4592,[460]5.4665,[461]5.4622,[462]5.4588,[463]5.4568,[464]5.4564,[465]5.4542,[466]5.4466,[467]5.4453,[468]5.4435,[469]5.4444,[470]5.4433,[471]5.4383,[472]5.4386,[473]5.4341,[474]5.4329,[475]5.4263,[476]5.4239,[477]5.4154,[478]5.4128,[479]5.4132,[480]5.4156,[481]5.4156,[482]5.4110,[483]5.4068,[484]5.4078,[485]5.4011,[486]5.3950,[487]5.3939,[488]5.3917,[489]5.3865,[490]5.3832,[491]5.3798,[492]5.3734,[493]5.3707,[494]5.3689,[495]5.3670,[496]5.3630,[497]5.3569,[498]5.3544,[499]5.3510,[500]5.3431,[501]5.3361,[502]5.3351,[503]5.3342,[504]5.3265,[505]5.3262,[506]5.3268,[507]5.3214,[508]5.3177,[509]5.3182,[510]5.3203,[511]5.3246,[512]5.3286,[513]5.3311,[514]5.3362,[515]5.3320,[516]5.3310,[517]5.3310,[518]5.3311,[519]5.3332,[520]5.3344,[521]5.3356,[522]5.3370,[523]5.3378,[524]5.3431,[525]5.3457,[526]5.3462,[527]5.3477,[528]5.3425,[529]5.3434,[530]5.3398,[531]5.3392,[532]5.3440,[533]5.3467,[534]5.3451,[535]5.3473,[536]5.3432,[537]5.3414,[538]5.3465,[539]5.3473,[540]5.3487,[541]5.3486,[542]5.3500,[543]5.3521,[544]5.3534,[545]5.3525,[546]5.3526,[547]5.3494,[548]5.3452,[549]5.3454,[550]5.3434,[551]5.3409,[552]5.3389,[553]5.3360,[554]5.3338,[555]5.3318,[556]5.3310,[557]5.3328,[558]5.3294,[559]5.3299,[560]5.3285,[561]5.3285,[562]5.3261,[563]5.3258,[564]5.3299,[565]5.3309,[566]5.3316,[567]5.3295,[568]5.3307,[569]5.3292,[570]5.3318,[571]5.3331,[572]5.3339,[573]5.3342,[574]5.3312,[575]5.3295,[576]5.3288,[577]5.3272,[578]5.3254,[579]5.3252,[580]5.3200,[581]5.3171,[582]5.3170,[583]5.3178,[584]5.3183,[585]5.3126,[586]5.3071,[587]5.3076,[588]5.3120,[589]5.3169,[590]5.3199,[591]5.3216,[592]5.3205,[593]5.3165,[594]5.3180,[595]5.3166,[596]5.3204,[597]5.3183,[598]5.3151,[599]5.3178,[600]5.3169,[601]5.3157,[602]5.3157,[603]5.3185,[604]5.3191,[605]5.3218,[606]5.3231,[607]5.3217,[608]5.3188,[609]5.3197,[610]5.3238,[611]5.3227,[612]5.3249,[613]5.3220,[614]5.3179,[615]5.3120,[616]5.3148,[617]5.3099,[618]5.3056,[619]5.3012,[620]5.2903,[621]5.2852,[622]5.2833,[623]5.2846,[624]5.2852,[625]5.2859,[626]5.2856,[627]5.2882,[628]5.2890,[629]5.2894,[630]5.2925,[631]5.2970,[632]5.3017,[633]5.3007,[634]5.3036,[635]5.3033,[636]5.2997,[637]5.2961,[638]5.2979,[639]5.2949,[640]5.2957,[641]5.2960,[642]5.3010,[643]5.3026,[644]5.3044,[645]5.3029,[646]5.3063,[647]5.3014,[648]5.3024,[649]5.3027,[650]5.3055,[651]5.3097,[652]5.3100,[653]5.3137,[654]5.3084,[655]5.3075,
llama_print_timings: load time = 6119.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 1858813.21 ms / 335360 tokens ( 5.54 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1889707.90 ms
Will make another run, this time using RMSE optimization (i.e. same as the one in OP) and double-check the reported 5.3117
result. But if it is confirmed, it would indicate that the RMSE optimization in this case is actually making the result worse for some reason.
My result for 13B, using Q4_3
with RMSE opt. + F16 output is: 5.2962
This result I think makes more sense since it is inline with my expectation that I described here: https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5689456
main: seed = 1682172642
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16-rmse.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 6 (mostly Q4_3)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.94 seconds per pass - ETA 32 minutes
[1]3.7362,[2]4.1744,[3]4.9576,[4]5.3621,[5]5.5410,[6]5.4788,[7]5.6392,[8]5.7500,[9]6.0088,[10]6.2366,[11]6.4228,[12]6.4859,[13]6.4491,[14]6.5428,[15]6.7439,[16]6.4225,[17]6.3396,[18]6.3169,[19]6.0233,[20]6.0024,[21]5.9256,[22]5.7530,[23]5.7201,[24]5.6258,[25]5.6327,[26]5.4845,[27]5.3094,[28]5.2083,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9915,[39]5.0202,[40]5.0616,[41]5.0862,[42]5.1202,[43]5.0861,[44]5.1307,[45]5.1348,[46]5.1096,[47]5.1370,[48]5.1183,[49]5.1225,[50]5.0927,[51]5.0998,[52]5.0920,[53]5.1385,[54]5.1290,[55]5.1113,[56]5.1311,[57]5.1489,[58]5.1710,[59]5.1904,[60]5.2260,[61]5.2188,[62]5.2735,[63]5.2982,[64]5.3100,[65]5.3463,[66]5.3455,[67]5.3634,[68]5.3761,[69]5.4045,[70]5.4349,[71]5.4582,[72]5.4919,[73]5.5385,[74]5.5451,[75]5.5550,[76]5.5687,[77]5.5802,[78]5.5664,[79]5.5933,[80]5.5871,[81]5.5951,[82]5.5919,[83]5.5466,[84]5.5365,[85]5.5301,[86]5.5156,[87]5.4509,[88]5.4070,[89]5.3858,[90]5.3750,[91]5.3960,[92]5.3922,[93]5.3940,[94]5.3927,[95]5.4193,[96]5.4162,[97]5.4128,[98]5.4089,[99]5.4020,[100]5.3994,[101]5.4223,[102]5.4177,[103]5.4330,[104]5.4377,[105]5.4390,[106]5.4531,[107]5.4517,[108]5.4666,[109]5.4659,[110]5.4606,[111]5.4784,[112]5.4950,[113]5.4943,[114]5.4930,[115]5.4972,[116]5.4852,[117]5.4847,[118]5.5081,[119]5.5259,[120]5.5549,[121]5.5701,[122]5.5912,[123]5.6276,[124]5.6452,[125]5.6402,[126]5.6758,[127]5.7086,[128]5.7369,[129]5.7256,[130]5.7341,[131]5.7301,[132]5.7257,[133]5.7132,[134]5.7222,[135]5.7222,[136]5.7139,[137]5.7100,[138]5.6974,[139]5.6896,[140]5.6884,[141]5.6614,[142]5.6575,[143]5.6327,[144]5.6168,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6050,[149]5.6019,[150]5.6011,[151]5.6057,[152]5.5999,[153]5.5903,[154]5.5846,[155]5.5908,[156]5.5892,[157]5.6045,[158]5.6062,[159]5.6072,[160]5.6110,[161]5.6225,[162]5.5972,[163]5.5878,[164]5.5676,[165]5.5427,[166]5.5196,[167]5.4880,[168]5.4613,[169]5.4483,[170]5.4390,[171]5.4185,[172]5.4062,[173]5.3930,[174]5.3661,[175]5.3457,[176]5.3327,[177]5.3162,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2597,[182]5.2438,[183]5.2320,[184]5.2312,[185]5.2241,[186]5.2253,[187]5.2309,[188]5.2284,[189]5.2448,[190]5.2451,[191]5.2620,[192]5.2756,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3325,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3735,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4163,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4304,[239]5.4193,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3816,[248]5.3755,[249]5.3723,[250]5.3757,[251]5.3675,[252]5.3627,[253]5.3537,[254]5.3493,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2891,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2515,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2745,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3391,[284]5.3474,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2514,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2237,[324]5.2220,[325]5.2315,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2444,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2513,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2655,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2743,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3007,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3151,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2825,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2877,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4240,[475]5.4171,[476]5.4148,[477]5.4065,[478]5.4035,[479]5.4036,[480]5.4061,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3849,[487]5.3837,[488]5.3815,[489]5.3761,[490]5.3730,[491]5.3698,[492]5.3630,[493]5.3604,[494]5.3585,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3431,[499]5.3395,[500]5.3314,[501]5.3245,[502]5.3236,[503]5.3225,[504]5.3149,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3066,[510]5.3088,[511]5.3130,[512]5.3170,[513]5.3194,[514]5.3248,[515]5.3208,[516]5.3198,[517]5.3197,[518]5.3198,[519]5.3219,[520]5.3233,[521]5.3245,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3353,[527]5.3369,[528]5.3314,[529]5.3323,[530]5.3287,[531]5.3282,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3357,[536]5.3316,[537]5.3298,[538]5.3347,[539]5.3355,[540]5.3371,[541]5.3369,[542]5.3382,[543]5.3404,[544]5.3416,[545]5.3406,[546]5.3408,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3313,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3197,[556]5.3189,[557]5.3208,[558]5.3175,[559]5.3178,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3140,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3204,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3163,[578]5.3144,[579]5.3144,[580]5.3091,[581]5.3061,[582]5.3061,[583]5.3069,[584]5.3074,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3008,[589]5.3058,[590]5.3088,[591]5.3105,[592]5.3094,[593]5.3054,[594]5.3068,[595]5.3052,[596]5.3091,[597]5.3071,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3045,[602]5.3046,[603]5.3073,[604]5.3079,[605]5.3106,[606]5.3120,[607]5.3105,[608]5.3077,[609]5.3086,[610]5.3126,[611]5.3116,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3011,[616]5.3036,[617]5.2986,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2739,[622]5.2721,[623]5.2734,[624]5.2737,[625]5.2745,[626]5.2741,[627]5.2768,[628]5.2776,[629]5.2780,[630]5.2811,[631]5.2854,[632]5.2902,[633]5.2891,[634]5.2920,[635]5.2918,[636]5.2884,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2845,[641]5.2849,[642]5.2899,[643]5.2915,[644]5.2933,[645]5.2919,[646]5.2953,[647]5.2902,[648]5.2913,[649]5.2914,[650]5.2943,[651]5.2983,[652]5.2987,[653]5.3024,[654]5.2970,[655]5.2962,
llama_print_timings: load time = 6171.79 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 1846949.17 ms / 335360 tokens ( 5.51 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1878169.61 ms
@ggerganov Are these results with or without the changes you made to Q4_3
after I opened this PR (and reported the results)?
@ggerganov Are these results with or without the changes you made to
Q4_3
after I opened this PR (and reported the results)?
It includes all changes from today related to Q4_3
quantization. Maybe this is the source of the difference, although it's still strange since the Q4_3
changes should just improve the performance. Of course, we cannot expect exact same results, but the difference is rather big. So not 100% sure. I can run one extra Q4_3
13B run with the build from yesterday to make sure
@ggerganov Rebased this branch on latest master, re-quantized, re-ran the perplexity. Now I get the lower result as well with OPEN_BLAS (5.2961
, so actually 0.0001
lower than cuBLAS). So, something else has happened that positively impacts results. Another observation is that OPEN_BLAS and cuBLAS results are not identical as they are for the fp16
model. They are very close, but not exactly the same. See details.
Is it possible this affects this comment you made in #729 ?
llama_print_timings: load time = 31077.61 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 9573081.81 ms / 335360 tokens ( 28.55 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 9604412.75 ms
I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes x
to F16 and casts y
to F16 and performs F16 mat mul, while OpenBLAS dequantizes x
to F32 and performs F32 mat mul (if I'm not mistaken)
cuBLAS dequantizes
x
to F16 and castsy
to F16 and performs F16 mat mul, while OpenBLAS dequantizesx
to F32 and performs F32 mat mul (if I'm not mistaken)
That's not exactly the case, when multiplying q x f32, cuBLAS dequantizes to f32 and does a f32 x f32 mat mul. The only difference with OpenBLAS is when performing a f16 x f32 mat mul (ggml_compute_forward_mul_mat_f16_f32
). In this case, src1 is converted to f16 instead of converting src0 to f32, and a f16 x f16 mat mul is done.
@ggerganov I propose we close this PR. Although there is some benefit from rmse minimization for QX_1
and QX_3
quantization of the 7B model, the benefit mostly goes away for 13B (and Q5_1
is actually worse with rmse minimization that without at 13B).
You are minimizing error - why it should be worse? It may be worse for one, but better for another case, no?
By that I mean that perplexity for wide range of other files (than en-wikitext--whatever) may be better. And not for one model but for another...
Quantization itself is here is to compress data as much as possible without affecting model's quality much.