llama.cpp RMSE-optimized quants for all quantization types

trafficstars

The PR adds a new build option (LLAMA_NO_RMSE), which is off by default. When off, all current quantization types (Q4_0, Q4_1, Q4_2, Q4_3) are performed with RMSE minimization (on master RMSE minimization is enabled for Q4_2 only and cannot easily be disabled).

This makes generation of quantized models quite a bit longer, but still in the same ballpark as it used to take before it was multi-threaded in PR #1075.

With this option enabled, Q4_3 gives a perplexity of 6.0344 for the 7B model, so 0.0273 lower than simple Q4_3 quantization as reported by @ggerganov in #406. If I also enable his trick of not quantizing output tensors, perplexity becomes 6.0085.

Perplexity result for Q4_3 without quantization of output tensors for the 13B model is 5.3117.

Details for these perplexity runs can be found in here (issue #406)

As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05.

Apr 21 '23 15:04 ikawrakow

sounds like a good idea. for me personally io is the bottleneck, since i store them on a NAS.

Apr 21 '23 16:04 Green-Sky

It might be a good idea to get #953 merged first, which implements unit tests for the quantization. But that requires an improvement to the test samples.

Apr 21 '23 17:04 sw

I'm still a bit skeptical if chasing after RMSE is the right thing to do.

Let me explain what I mean: originally the Q4 methods calculate max(abs()) and divide that by 7. #729 intends to calculate the signed max, then divide by 8. This PR tries to find the divisor for minimum RMS error. But maybe the princess is in another castle?

What if it actually helps perplexity if we clip the largest values somewhat, even if that comes at a higher RMS error?

   ^
p  |
e  |
r  | *
p  |     orig                        *
l  |     *      #729                  
e  |            *              *
x  | - - - - - - - - - - - - - - - - < RMSE optimum #1106
i  |                           
t  |                   *             < perplexity optimum?
y  |
   +-----|------|------|-------------> 
         7      8      ?
             scale factor

So the approach to find that would be use #729, choose a value in the interesting range of maybe [7,11], quantize the model, do a perplexity run, lather, rinse, repeat.

Apr 22 '23 08:04 sw

@ikawrakow

Just made a full cuBLAS run on 13B using Q4_3, without RMSE optimization and output in F16 precision and got: 5.3075

main: seed = 1682170268
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required  = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.93 seconds per pass - ETA 32 minutes
[1]3.7052,[2]4.1553,[3]4.9530,[4]5.3817,[5]5.5598,[6]5.4938,[7]5.6338,[8]5.7492,[9]6.0136,[10]6.2525,[11]6.4388,[12]6.4983,[13]6.4590,[14]6.5567,[15]6.7657,[16]6.4420,[17]6.3526,[18]6.3318,[19]6.0375,[20]6.0170,[21]5.9417,[22]5.7639,[23]5.7352,[24]5.6400,[25]5.6548,[26]5.5023,[27]5.3302,[28]5.2330,[29]5.1565,[30]5.0200,[31]4.9747,[32]4.9854,[33]4.9409,[34]4.9796,[35]4.9984,[36]5.0189,[37]5.0113,[38]5.0078,[39]5.0349,[40]5.0774,[41]5.0999,[42]5.1325,[43]5.0970,[44]5.1402,[45]5.1450,[46]5.1202,[47]5.1464,[48]5.1286,[49]5.1304,[50]5.0999,[51]5.1075,[52]5.1012,[53]5.1478,[54]5.1379,[55]5.1200,[56]5.1404,[57]5.1594,[58]5.1818,[59]5.2003,[60]5.2387,[61]5.2315,[62]5.2862,[63]5.3117,[64]5.3227,[65]5.3586,[66]5.3594,[67]5.3771,[68]5.3901,[69]5.4182,[70]5.4484,[71]5.4717,[72]5.5064,[73]5.5534,[74]5.5610,[75]5.5703,[76]5.5838,[77]5.5960,[78]5.5827,[79]5.6087,[80]5.6043,[81]5.6133,[82]5.6107,[83]5.5655,[84]5.5553,[85]5.5483,[86]5.5331,[87]5.4686,[88]5.4265,[89]5.4044,[90]5.3939,[91]5.4152,[92]5.4128,[93]5.4153,[94]5.4153,[95]5.4412,[96]5.4383,[97]5.4336,[98]5.4300,[99]5.4225,[100]5.4204,[101]5.4440,[102]5.4397,[103]5.4550,[104]5.4598,[105]5.4610,[106]5.4753,[107]5.4745,[108]5.4894,[109]5.4882,[110]5.4833,[111]5.5022,[112]5.5191,[113]5.5182,[114]5.5175,[115]5.5215,[116]5.5093,[117]5.5097,[118]5.5330,[119]5.5514,[120]5.5800,[121]5.5945,[122]5.6158,[123]5.6525,[124]5.6684,[125]5.6634,[126]5.6990,[127]5.7300,[128]5.7574,[129]5.7454,[130]5.7539,[131]5.7490,[132]5.7446,[133]5.7318,[134]5.7402,[135]5.7392,[136]5.7311,[137]5.7266,[138]5.7136,[139]5.7058,[140]5.7050,[141]5.6776,[142]5.6734,[143]5.6487,[144]5.6326,[145]5.6238,[146]5.6132,[147]5.6179,[148]5.6202,[149]5.6169,[150]5.6165,[151]5.6212,[152]5.6153,[153]5.6064,[154]5.6005,[155]5.6066,[156]5.6042,[157]5.6202,[158]5.6226,[159]5.6232,[160]5.6268,[161]5.6384,[162]5.6133,[163]5.6034,[164]5.5826,[165]5.5576,[166]5.5342,[167]5.5020,[168]5.4757,[169]5.4622,[170]5.4531,[171]5.4325,[172]5.4202,[173]5.4072,[174]5.3805,[175]5.3599,[176]5.3462,[177]5.3294,[178]5.3096,[179]5.2962,[180]5.2892,[181]5.2729,[182]5.2565,[183]5.2445,[184]5.2435,[185]5.2367,[186]5.2377,[187]5.2436,[188]5.2419,[189]5.2583,[190]5.2585,[191]5.2758,[192]5.2892,[193]5.3032,[194]5.3145,[195]5.3332,[196]5.3447,[197]5.3635,[198]5.3770,[199]5.3788,[200]5.3797,[201]5.3730,[202]5.3862,[203]5.3922,[204]5.3871,[205]5.3960,[206]5.4014,[207]5.3972,[208]5.4033,[209]5.4065,[210]5.4120,[211]5.4227,[212]5.4292,[213]5.4386,[214]5.4415,[215]5.4445,[216]5.4570,[217]5.4734,[218]5.4867,[219]5.4863,[220]5.4836,[221]5.4789,[222]5.4792,[223]5.4732,[224]5.4665,[225]5.4628,[226]5.4829,[227]5.4883,[228]5.4956,[229]5.5025,[230]5.4989,[231]5.5143,[232]5.5036,[233]5.4888,[234]5.4747,[235]5.4525,[236]5.4473,[237]5.4386,[238]5.4417,[239]5.4306,[240]5.4218,[241]5.4251,[242]5.4265,[243]5.4257,[244]5.4163,[245]5.4128,[246]5.4028,[247]5.3930,[248]5.3868,[249]5.3837,[250]5.3874,[251]5.3792,[252]5.3743,[253]5.3653,[254]5.3607,[255]5.3515,[256]5.3350,[257]5.3249,[258]5.3183,[259]5.3173,[260]5.3090,[261]5.3038,[262]5.2997,[263]5.2947,[264]5.2711,[265]5.2707,[266]5.2679,[267]5.2618,[268]5.2684,[269]5.2676,[270]5.2685,[271]5.2749,[272]5.2778,[273]5.2794,[274]5.2802,[275]5.2861,[276]5.2918,[277]5.3039,[278]5.3125,[279]5.3207,[280]5.3244,[281]5.3339,[282]5.3395,[283]5.3517,[284]5.3602,[285]5.3681,[286]5.3805,[287]5.3778,[288]5.3831,[289]5.3770,[290]5.3628,[291]5.3498,[292]5.3364,[293]5.3246,[294]5.3254,[295]5.3256,[296]5.3304,[297]5.3295,[298]5.3317,[299]5.3295,[300]5.3208,[301]5.3211,[302]5.3147,[303]5.3065,[304]5.2992,[305]5.2967,[306]5.2864,[307]5.2893,[308]5.2904,[309]5.2772,[310]5.2743,[311]5.2698,[312]5.2711,[313]5.2657,[314]5.2642,[315]5.2510,[316]5.2470,[317]5.2344,[318]5.2184,[319]5.2289,[320]5.2399,[321]5.2447,[322]5.2418,[323]5.2358,[324]5.2339,[325]5.2436,[326]5.2452,[327]5.2460,[328]5.2495,[329]5.2540,[330]5.2561,[331]5.2663,[332]5.2627,[333]5.2701,[334]5.2656,[335]5.2605,[336]5.2629,[337]5.2619,[338]5.2615,[339]5.2571,[340]5.2539,[341]5.2602,[342]5.2634,[343]5.2674,[344]5.2677,[345]5.2692,[346]5.2676,[347]5.2712,[348]5.2750,[349]5.2773,[350]5.2754,[351]5.2767,[352]5.2769,[353]5.2716,[354]5.2725,[355]5.2774,[356]5.2802,[357]5.2774,[358]5.2854,[359]5.2874,[360]5.2843,[361]5.2843,[362]5.2913,[363]5.3020,[364]5.3072,[365]5.3110,[366]5.3126,[367]5.3213,[368]5.3190,[369]5.3204,[370]5.3224,[371]5.3185,[372]5.3231,[373]5.3270,[374]5.3251,[375]5.3248,[376]5.3306,[377]5.3271,[378]5.3296,[379]5.3330,[380]5.3264,[381]5.3235,[382]5.3196,[383]5.3176,[384]5.3176,[385]5.3166,[386]5.3152,[387]5.3152,[388]5.3126,[389]5.3088,[390]5.3036,[391]5.2979,[392]5.2944,[393]5.2939,[394]5.2970,[395]5.2963,[396]5.2909,[397]5.2973,[398]5.3014,[399]5.3083,[400]5.3077,[401]5.3085,[402]5.3097,[403]5.3119,[404]5.3173,[405]5.3023,[406]5.2982,[407]5.2970,[408]5.2980,[409]5.3090,[410]5.3178,[411]5.3271,[412]5.3412,[413]5.3513,[414]5.3571,[415]5.3630,[416]5.3702,[417]5.3798,[418]5.3822,[419]5.3871,[420]5.3947,[421]5.4045,[422]5.4077,[423]5.4134,[424]5.4224,[425]5.4301,[426]5.4360,[427]5.4401,[428]5.4473,[429]5.4509,[430]5.4572,[431]5.4696,[432]5.4727,[433]5.4721,[434]5.4688,[435]5.4701,[436]5.4730,[437]5.4812,[438]5.4887,[439]5.4856,[440]5.4850,[441]5.4808,[442]5.4796,[443]5.4807,[444]5.4824,[445]5.4815,[446]5.4835,[447]5.4859,[448]5.4892,[449]5.4876,[450]5.4888,[451]5.4862,[452]5.4707,[453]5.4614,[454]5.4560,[455]5.4563,[456]5.4601,[457]5.4612,[458]5.4594,[459]5.4592,[460]5.4665,[461]5.4622,[462]5.4588,[463]5.4568,[464]5.4564,[465]5.4542,[466]5.4466,[467]5.4453,[468]5.4435,[469]5.4444,[470]5.4433,[471]5.4383,[472]5.4386,[473]5.4341,[474]5.4329,[475]5.4263,[476]5.4239,[477]5.4154,[478]5.4128,[479]5.4132,[480]5.4156,[481]5.4156,[482]5.4110,[483]5.4068,[484]5.4078,[485]5.4011,[486]5.3950,[487]5.3939,[488]5.3917,[489]5.3865,[490]5.3832,[491]5.3798,[492]5.3734,[493]5.3707,[494]5.3689,[495]5.3670,[496]5.3630,[497]5.3569,[498]5.3544,[499]5.3510,[500]5.3431,[501]5.3361,[502]5.3351,[503]5.3342,[504]5.3265,[505]5.3262,[506]5.3268,[507]5.3214,[508]5.3177,[509]5.3182,[510]5.3203,[511]5.3246,[512]5.3286,[513]5.3311,[514]5.3362,[515]5.3320,[516]5.3310,[517]5.3310,[518]5.3311,[519]5.3332,[520]5.3344,[521]5.3356,[522]5.3370,[523]5.3378,[524]5.3431,[525]5.3457,[526]5.3462,[527]5.3477,[528]5.3425,[529]5.3434,[530]5.3398,[531]5.3392,[532]5.3440,[533]5.3467,[534]5.3451,[535]5.3473,[536]5.3432,[537]5.3414,[538]5.3465,[539]5.3473,[540]5.3487,[541]5.3486,[542]5.3500,[543]5.3521,[544]5.3534,[545]5.3525,[546]5.3526,[547]5.3494,[548]5.3452,[549]5.3454,[550]5.3434,[551]5.3409,[552]5.3389,[553]5.3360,[554]5.3338,[555]5.3318,[556]5.3310,[557]5.3328,[558]5.3294,[559]5.3299,[560]5.3285,[561]5.3285,[562]5.3261,[563]5.3258,[564]5.3299,[565]5.3309,[566]5.3316,[567]5.3295,[568]5.3307,[569]5.3292,[570]5.3318,[571]5.3331,[572]5.3339,[573]5.3342,[574]5.3312,[575]5.3295,[576]5.3288,[577]5.3272,[578]5.3254,[579]5.3252,[580]5.3200,[581]5.3171,[582]5.3170,[583]5.3178,[584]5.3183,[585]5.3126,[586]5.3071,[587]5.3076,[588]5.3120,[589]5.3169,[590]5.3199,[591]5.3216,[592]5.3205,[593]5.3165,[594]5.3180,[595]5.3166,[596]5.3204,[597]5.3183,[598]5.3151,[599]5.3178,[600]5.3169,[601]5.3157,[602]5.3157,[603]5.3185,[604]5.3191,[605]5.3218,[606]5.3231,[607]5.3217,[608]5.3188,[609]5.3197,[610]5.3238,[611]5.3227,[612]5.3249,[613]5.3220,[614]5.3179,[615]5.3120,[616]5.3148,[617]5.3099,[618]5.3056,[619]5.3012,[620]5.2903,[621]5.2852,[622]5.2833,[623]5.2846,[624]5.2852,[625]5.2859,[626]5.2856,[627]5.2882,[628]5.2890,[629]5.2894,[630]5.2925,[631]5.2970,[632]5.3017,[633]5.3007,[634]5.3036,[635]5.3033,[636]5.2997,[637]5.2961,[638]5.2979,[639]5.2949,[640]5.2957,[641]5.2960,[642]5.3010,[643]5.3026,[644]5.3044,[645]5.3029,[646]5.3063,[647]5.3014,[648]5.3024,[649]5.3027,[650]5.3055,[651]5.3097,[652]5.3100,[653]5.3137,[654]5.3084,[655]5.3075,
llama_print_timings:        load time =  6119.84 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1858813.21 ms / 335360 tokens (    5.54 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1889707.90 ms

Will make another run, this time using RMSE optimization (i.e. same as the one in OP) and double-check the reported 5.3117 result. But if it is confirmed, it would indicate that the RMSE optimization in this case is actually making the result worse for some reason.

Apr 22 '23 14:04 ggerganov

My result for 13B, using Q4_3 with RMSE opt. + F16 output is: 5.2962

This result I think makes more sense since it is inline with my expectation that I described here: https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5689456

main: seed = 1682172642
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16-rmse.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required  = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.94 seconds per pass - ETA 32 minutes
[1]3.7362,[2]4.1744,[3]4.9576,[4]5.3621,[5]5.5410,[6]5.4788,[7]5.6392,[8]5.7500,[9]6.0088,[10]6.2366,[11]6.4228,[12]6.4859,[13]6.4491,[14]6.5428,[15]6.7439,[16]6.4225,[17]6.3396,[18]6.3169,[19]6.0233,[20]6.0024,[21]5.9256,[22]5.7530,[23]5.7201,[24]5.6258,[25]5.6327,[26]5.4845,[27]5.3094,[28]5.2083,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9915,[39]5.0202,[40]5.0616,[41]5.0862,[42]5.1202,[43]5.0861,[44]5.1307,[45]5.1348,[46]5.1096,[47]5.1370,[48]5.1183,[49]5.1225,[50]5.0927,[51]5.0998,[52]5.0920,[53]5.1385,[54]5.1290,[55]5.1113,[56]5.1311,[57]5.1489,[58]5.1710,[59]5.1904,[60]5.2260,[61]5.2188,[62]5.2735,[63]5.2982,[64]5.3100,[65]5.3463,[66]5.3455,[67]5.3634,[68]5.3761,[69]5.4045,[70]5.4349,[71]5.4582,[72]5.4919,[73]5.5385,[74]5.5451,[75]5.5550,[76]5.5687,[77]5.5802,[78]5.5664,[79]5.5933,[80]5.5871,[81]5.5951,[82]5.5919,[83]5.5466,[84]5.5365,[85]5.5301,[86]5.5156,[87]5.4509,[88]5.4070,[89]5.3858,[90]5.3750,[91]5.3960,[92]5.3922,[93]5.3940,[94]5.3927,[95]5.4193,[96]5.4162,[97]5.4128,[98]5.4089,[99]5.4020,[100]5.3994,[101]5.4223,[102]5.4177,[103]5.4330,[104]5.4377,[105]5.4390,[106]5.4531,[107]5.4517,[108]5.4666,[109]5.4659,[110]5.4606,[111]5.4784,[112]5.4950,[113]5.4943,[114]5.4930,[115]5.4972,[116]5.4852,[117]5.4847,[118]5.5081,[119]5.5259,[120]5.5549,[121]5.5701,[122]5.5912,[123]5.6276,[124]5.6452,[125]5.6402,[126]5.6758,[127]5.7086,[128]5.7369,[129]5.7256,[130]5.7341,[131]5.7301,[132]5.7257,[133]5.7132,[134]5.7222,[135]5.7222,[136]5.7139,[137]5.7100,[138]5.6974,[139]5.6896,[140]5.6884,[141]5.6614,[142]5.6575,[143]5.6327,[144]5.6168,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6050,[149]5.6019,[150]5.6011,[151]5.6057,[152]5.5999,[153]5.5903,[154]5.5846,[155]5.5908,[156]5.5892,[157]5.6045,[158]5.6062,[159]5.6072,[160]5.6110,[161]5.6225,[162]5.5972,[163]5.5878,[164]5.5676,[165]5.5427,[166]5.5196,[167]5.4880,[168]5.4613,[169]5.4483,[170]5.4390,[171]5.4185,[172]5.4062,[173]5.3930,[174]5.3661,[175]5.3457,[176]5.3327,[177]5.3162,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2597,[182]5.2438,[183]5.2320,[184]5.2312,[185]5.2241,[186]5.2253,[187]5.2309,[188]5.2284,[189]5.2448,[190]5.2451,[191]5.2620,[192]5.2756,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3325,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3735,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4163,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4304,[239]5.4193,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3816,[248]5.3755,[249]5.3723,[250]5.3757,[251]5.3675,[252]5.3627,[253]5.3537,[254]5.3493,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2891,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2515,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2745,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3391,[284]5.3474,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2514,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2237,[324]5.2220,[325]5.2315,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2444,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2513,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2655,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2743,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3007,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3151,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2825,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2877,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4240,[475]5.4171,[476]5.4148,[477]5.4065,[478]5.4035,[479]5.4036,[480]5.4061,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3849,[487]5.3837,[488]5.3815,[489]5.3761,[490]5.3730,[491]5.3698,[492]5.3630,[493]5.3604,[494]5.3585,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3431,[499]5.3395,[500]5.3314,[501]5.3245,[502]5.3236,[503]5.3225,[504]5.3149,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3066,[510]5.3088,[511]5.3130,[512]5.3170,[513]5.3194,[514]5.3248,[515]5.3208,[516]5.3198,[517]5.3197,[518]5.3198,[519]5.3219,[520]5.3233,[521]5.3245,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3353,[527]5.3369,[528]5.3314,[529]5.3323,[530]5.3287,[531]5.3282,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3357,[536]5.3316,[537]5.3298,[538]5.3347,[539]5.3355,[540]5.3371,[541]5.3369,[542]5.3382,[543]5.3404,[544]5.3416,[545]5.3406,[546]5.3408,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3313,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3197,[556]5.3189,[557]5.3208,[558]5.3175,[559]5.3178,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3140,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3204,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3163,[578]5.3144,[579]5.3144,[580]5.3091,[581]5.3061,[582]5.3061,[583]5.3069,[584]5.3074,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3008,[589]5.3058,[590]5.3088,[591]5.3105,[592]5.3094,[593]5.3054,[594]5.3068,[595]5.3052,[596]5.3091,[597]5.3071,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3045,[602]5.3046,[603]5.3073,[604]5.3079,[605]5.3106,[606]5.3120,[607]5.3105,[608]5.3077,[609]5.3086,[610]5.3126,[611]5.3116,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3011,[616]5.3036,[617]5.2986,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2739,[622]5.2721,[623]5.2734,[624]5.2737,[625]5.2745,[626]5.2741,[627]5.2768,[628]5.2776,[629]5.2780,[630]5.2811,[631]5.2854,[632]5.2902,[633]5.2891,[634]5.2920,[635]5.2918,[636]5.2884,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2845,[641]5.2849,[642]5.2899,[643]5.2915,[644]5.2933,[645]5.2919,[646]5.2953,[647]5.2902,[648]5.2913,[649]5.2914,[650]5.2943,[651]5.2983,[652]5.2987,[653]5.3024,[654]5.2970,[655]5.2962,
llama_print_timings:        load time =  6171.79 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1846949.17 ms / 335360 tokens (    5.51 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1878169.61 ms

Apr 22 '23 14:04 ggerganov

@ggerganov Are these results with or without the changes you made to Q4_3 after I opened this PR (and reported the results)?

Apr 22 '23 17:04 ikawrakow

@ggerganov Are these results with or without the changes you made to Q4_3 after I opened this PR (and reported the results)?

It includes all changes from today related to Q4_3 quantization. Maybe this is the source of the difference, although it's still strange since the Q4_3 changes should just improve the performance. Of course, we cannot expect exact same results, but the difference is rather big. So not 100% sure. I can run one extra Q4_3 13B run with the build from yesterday to make sure

Apr 22 '23 20:04 ggerganov

@ggerganov Rebased this branch on latest master, re-quantized, re-ran the perplexity. Now I get the lower result as well with OPEN_BLAS (5.2961, so actually 0.0001 lower than cuBLAS). So, something else has happened that positively impacts results. Another observation is that OPEN_BLAS and cuBLAS results are not identical as they are for the fp16 model. They are very close, but not exactly the same. See details.

Is it possible this affects this comment you made in #729 ?

perplexity : calculating perplexity over 655 chunks, batch_size=512 30.59 seconds per pass - ETA 5 hours 33 minutes [1]3.7363,[2]4.1741,[3]4.9573,[4]5.3622,[5]5.5408,[6]5.4786,[7]5.6388,[8]5.7498,[9]6.0085,[10]6.2361,[11]6.4224,[12]6.4857,[13]6.4488,[14]6.5426,[15]6.7435,[16]6.4223,[17]6.3394,[18]6.3167,[19]6.0232,[20]6.0023,[21]5.9256,[22]5.7530,[23]5.7200,[24]5.6257,[25]5.6326,[26]5.4844,[27]5.3093,[28]5.2082,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9914,[39]5.0201,[40]5.0615,[41]5.0861,[42]5.1202,[43]5.0861,[44]5.1306,[45]5.1347,[46]5.1095,[47]5.1369,[48]5.1183,[49]5.1225,[50]5.0926,[51]5.0997,[52]5.0919,[53]5.1384,[54]5.1290,[55]5.1113,[56]5.1310,[57]5.1488,[58]5.1709,[59]5.1902,[60]5.2259,[61]5.2187,[62]5.2734,[63]5.2981,[64]5.3100,[65]5.3462,[66]5.3454,[67]5.3633,[68]5.3760,[69]5.4044,[70]5.4348,[71]5.4581,[72]5.4918,[73]5.5384,[74]5.5450,[75]5.5549,[76]5.5686,[77]5.5801,[78]5.5663,[79]5.5932,[80]5.5870,[81]5.5950,[82]5.5918,[83]5.5465,[84]5.5364,[85]5.5300,[86]5.5155,[87]5.4508,[88]5.4069,[89]5.3857,[90]5.3749,[91]5.3959,[92]5.3921,[93]5.3939,[94]5.3926,[95]5.4191,[96]5.4161,[97]5.4127,[98]5.4088,[99]5.4019,[100]5.3993,[101]5.4222,[102]5.4176,[103]5.4329,[104]5.4376,[105]5.4389,[106]5.4529,[107]5.4516,[108]5.4665,[109]5.4657,[110]5.4605,[111]5.4783,[112]5.4949,[113]5.4942,[114]5.4929,[115]5.4971,[116]5.4851,[117]5.4846,[118]5.5080,[119]5.5258,[120]5.5548,[121]5.5700,[122]5.5911,[123]5.6275,[124]5.6451,[125]5.6401,[126]5.6757,[127]5.7085,[128]5.7368,[129]5.7255,[130]5.7340,[131]5.7300,[132]5.7256,[133]5.7132,[134]5.7221,[135]5.7221,[136]5.7139,[137]5.7100,[138]5.6973,[139]5.6895,[140]5.6883,[141]5.6613,[142]5.6574,[143]5.6326,[144]5.6167,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6049,[149]5.6018,[150]5.6011,[151]5.6057,[152]5.5998,[153]5.5902,[154]5.5846,[155]5.5907,[156]5.5891,[157]5.6045,[158]5.6061,[159]5.6071,[160]5.6109,[161]5.6225,[162]5.5971,[163]5.5877,[164]5.5676,[165]5.5426,[166]5.5195,[167]5.4880,[168]5.4612,[169]5.4483,[170]5.4389,[171]5.4184,[172]5.4062,[173]5.3929,[174]5.3660,[175]5.3457,[176]5.3327,[177]5.3161,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2596,[182]5.2438,[183]5.2319,[184]5.2311,[185]5.2240,[186]5.2252,[187]5.2308,[188]5.2284,[189]5.2447,[190]5.2451,[191]5.2619,[192]5.2755,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3324,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3734,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4164,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4303,[239]5.4192,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3815,[248]5.3755,[249]5.3722,[250]5.3757,[251]5.3675,[252]5.3626,[253]5.3537,[254]5.3492,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2890,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2514,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2744,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3390,[284]5.3473,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2515,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2238,[324]5.2220,[325]5.2316,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2445,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2514,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2656,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2744,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3008,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3152,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2826,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2876,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4239,[475]5.4171,[476]5.4147,[477]5.4064,[478]5.4035,[479]5.4036,[480]5.4060,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3848,[487]5.3836,[488]5.3814,[489]5.3761,[490]5.3730,[491]5.3697,[492]5.3630,[493]5.3603,[494]5.3584,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3430,[499]5.3394,[500]5.3313,[501]5.3245,[502]5.3235,[503]5.3225,[504]5.3148,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3065,[510]5.3088,[511]5.3130,[512]5.3169,[513]5.3194,[514]5.3247,[515]5.3207,[516]5.3197,[517]5.3197,[518]5.3197,[519]5.3219,[520]5.3233,[521]5.3244,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3352,[527]5.3368,[528]5.3313,[529]5.3323,[530]5.3286,[531]5.3281,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3356,[536]5.3315,[537]5.3298,[538]5.3346,[539]5.3354,[540]5.3370,[541]5.3368,[542]5.3382,[543]5.3403,[544]5.3415,[545]5.3405,[546]5.3407,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3312,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3196,[556]5.3189,[557]5.3207,[558]5.3174,[559]5.3177,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3139,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3203,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3162,[578]5.3144,[579]5.3144,[580]5.3090,[581]5.3061,[582]5.3061,[583]5.3068,[584]5.3073,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3007,[589]5.3057,[590]5.3087,[591]5.3104,[592]5.3093,[593]5.3053,[594]5.3068,[595]5.3052,[596]5.3090,[597]5.3070,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3044,[602]5.3045,[603]5.3073,[604]5.3079,[605]5.3105,[606]5.3119,[607]5.3105,[608]5.3077,[609]5.3085,[610]5.3126,[611]5.3115,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3010,[616]5.3035,[617]5.2985,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2738,[622]5.2720,[623]5.2733,[624]5.2736,[625]5.2744,[626]5.2741,[627]5.2767,[628]5.2776,[629]5.2779,[630]5.2811,[631]5.2853,[632]5.2901,[633]5.2890,[634]5.2920,[635]5.2917,[636]5.2883,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2844,[641]5.2849,[642]5.2898,[643]5.2915,[644]5.2932,[645]5.2919,[646]5.2953,[647]5.2901,[648]5.2912,[649]5.2914,[650]5.2942,[651]5.2982,[652]5.2987,[653]5.3024,[654]5.2969,[655]5.2961,

llama_print_timings: load time = 31077.61 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 9573081.81 ms / 335360 tokens ( 28.55 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 9604412.75 ms

Apr 22 '23 20:04 ikawrakow

I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes x to F16 and casts y to F16 and performs F16 mat mul, while OpenBLAS dequantizes x to F32 and performs F32 mat mul (if I'm not mistaken)

Apr 22 '23 20:04 ggerganov

cuBLAS dequantizes x to F16 and casts y to F16 and performs F16 mat mul, while OpenBLAS dequantizes x to F32 and performs F32 mat mul (if I'm not mistaken)

That's not exactly the case, when multiplying q x f32, cuBLAS dequantizes to f32 and does a f32 x f32 mat mul. The only difference with OpenBLAS is when performing a f16 x f32 mat mul (ggml_compute_forward_mul_mat_f16_f32). In this case, src1 is converted to f16 instead of converting src0 to f32, and a f16 x f16 mat mul is done.

Apr 22 '23 20:04 slaren

@ggerganov I propose we close this PR. Although there is some benefit from rmse minimization for QX_1 and QX_3 quantization of the 7B model, the benefit mostly goes away for 13B (and Q5_1 is actually worse with rmse minimization that without at 13B).

Apr 30 '23 17:04 ikawrakow

You are minimizing error - why it should be worse? It may be worse for one, but better for another case, no?

May 03 '23 19:05 ivanstepanovftw

By that I mean that perplexity for wide range of other files (than en-wikitext--whatever) may be better. And not for one model but for another...

Quantization itself is here is to compress data as much as possible without affecting model's quality much.

May 03 '23 20:05 ivanstepanovftw

llama.cpp llama.cpp copied to clipboard

RMSE-optimized quants for all quantization types

llama.cpp
llama.cpp copied to clipboard