Variable bit rate quantization
Variable bit rate is commonly used in audio and video compression, so why not try on LLMs?
My guess is that a locally adaptive variable bit rate would require a major change to ggml. So, then, the least one can try is to see if using different number of bits in the different network layers would be beneficial.
As a first step, I simply changed llama.cpp to not quantize one of the tensor types in addition to output.weight (which is already known to have a significant impact on generation quality) and calculated perplexity for Q2_4 quantization (see issue #1240). Picked 2-bit quantization because there the difference between a quantized and not quantized tensor will be largest, so it would be easiest to see the effect. The following table summarizes the results (PPL improvement is perplexity with fp16 output.weight - perplexity with fp16 output weight + indicated tensor, table is sorted in decreasing order of impact)
| Tensor type | PPL improvement |
|---|---|
| feed_forward.w2 | 1.0832 |
| attention.wv | 0.7819 |
| feed_forward.w3 | 0.5658 |
| feed_forward.w1 | 0.3917 |
| attention.wo | 0.3902 |
| attention.wq | 0.1250 |
| attention.wk | 0.1090 |
| tok_embeddings | < 0.05 |
Interesting to note that the tok_embeddings tensor, which has been considered worthy of a dedicated quantization type LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16 where it is kept as fp16, has basically no influence on generation quality even when quantized with 2 bits.
Based on these findings, I ran 2- and 3-bit perplexity calculations where the top 2 tensors feed_forward.w2 and attention.wv are quantized using Q5_1 instead of Q2_4 or Q3_4. Here are the results:
| Model | Quantization | file size | Perplexity |
|---|---|---|---|
| 7B | Q2_4 + Q5_1 | 3.16G | 6.8160 |
| 7B | Q3_4 + Q5_1 | 3.7G | 6.0996 |
| 13B | Q2_4 + Q5_1 | 6.1G | 5.7880 |
| 13B | Q3_4 + Q5_1 | 7.1G | 5.3715 |
Interesting to note that the mixed Q3_4 + Q5_1 quantization has a lower perplexity than any 4-bit quantization listed on the main page for the 7B model despite the smaller quantized model size.
I have not explored only using Q5_1 for a subset of the feed_forward.w2 and attention.wv tensors. The quantization rmse for these two tensor types increases with layer depth, so perhaps it would be sufficient to use a higher bit rate for only the last few layers, thus reducing quantized model size compared to what is given in the above table.
Here are the complete runs for the above table. There is no new quantization type, just a quick hack where I added
if (tensor.name == "output.weight" ||
tensor.name.find("attention.wv.weight") != std::string::npos ||
tensor.name.find("feed_forward.w2.weight") != std::string::npos) {
new_type = GGML_TYPE_Q5_1;
}
just after this line in llama.cpp.
7B, Q2_4 + Q5_1
main: seed = 1682863345 llama.cpp: loading model from ../build/junk.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 15 (mostly Q2_4) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5026.65 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MBsystem_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 1.61 seconds per pass - ETA 17 minutes [1]4.8689,[2]5.5380,[3]6.3512,[4]7.0241,[5]7.0748,[6]7.0472,[7]7.2764,[8]7.3672,[9]7.7450,[10]8.0356,[11]8.2671,[12]8.3551,[13]8.2873,[14]8.4518,[15]8.7368,[16]8.2773,[17]8.1054,[18]8.0547,[19]7.6308,[20]7.5978,[21]7.4992,[22]7.3066,[23]7.2867,[24]7.1873,[25]7.1983,[26]7.0220,[27]6.8122,[28]6.7195,[29]6.6220,[30]6.4602,[31]6.4277,[32]6.4450,[33]6.3845,[34]6.4217,[35]6.4431,[36]6.4962,[37]6.4978,[38]6.5093,[39]6.5542,[40]6.6065,[41]6.6159,[42]6.6562,[43]6.6069,[44]6.6684,[45]6.6746,[46]6.6483,[47]6.6718,[48]6.6386,[49]6.6381,[50]6.5934,[51]6.5892,[52]6.5701,[53]6.6233,[54]6.6083,[55]6.5750,[56]6.6126,[57]6.6321,[58]6.6554,[59]6.6693,[60]6.7166,[61]6.7055,[62]6.7665,[63]6.8012,[64]6.8124,[65]6.8597,[66]6.8646,[67]6.8819,[68]6.8975,[69]6.9258,[70]6.9585,[71]6.9829,[72]7.0148,[73]7.0743,[74]7.0740,[75]7.0859,[76]7.0966,[77]7.1070,[78]7.0913,[79]7.1232,[80]7.1149,[81]7.1345,[82]7.1442,[83]7.0841,[84]7.0665,[85]7.0551,[86]7.0270,[87]6.9683,[88]6.9379,[89]6.9188,[90]6.9039,[91]6.9317,[92]6.9235,[93]6.9269,[94]6.9251,[95]6.9596,[96]6.9609,[97]6.9612,[98]6.9529,[99]6.9338,[100]6.9302,[101]6.9593,[102]6.9523,[103]6.9737,[104]6.9851,[105]6.9858,[106]7.0021,[107]6.9985,[108]7.0140,[109]7.0090,[110]7.0046,[111]7.0250,[112]7.0456,[113]7.0524,[114]7.0490,[115]7.0560,[116]7.0459,[117]7.0517,[118]7.0811,[119]7.1028,[120]7.1413,[121]7.1622,[122]7.1874,[123]7.2290,[124]7.2475,[125]7.2360,[126]7.2746,[127]7.3097,[128]7.3423,[129]7.3207,[130]7.3302,[131]7.3224,[132]7.3129,[133]7.3003,[134]7.3145,[135]7.3089,[136]7.2968,[137]7.2890,[138]7.2756,[139]7.2651,[140]7.2619,[141]7.2379,[142]7.2336,[143]7.2057,[144]7.1847,[145]7.1773,[146]7.1654,[147]7.1718,[148]7.1704,[149]7.1673,[150]7.1636,[151]7.1664,[152]7.1540,[153]7.1358,[154]7.1244,[155]7.1304,[156]7.1259,[157]7.1421,[158]7.1439,[159]7.1515,[160]7.1542,[161]7.1698,[162]7.1374,[163]7.1246,[164]7.0971,[165]7.0609,[166]7.0288,[167]6.9864,[168]6.9518,[169]6.9381,[170]6.9242,[171]6.8931,[172]6.8715,[173]6.8532,[174]6.8210,[175]6.7957,[176]6.7799,[177]6.7599,[178]6.7353,[179]6.7163,[180]6.7042,[181]6.6793,[182]6.6581,[183]6.6411,[184]6.6389,[185]6.6304,[186]6.6321,[187]6.6376,[188]6.6348,[189]6.6532,[190]6.6555,[191]6.6786,[192]6.6938,[193]6.7137,[194]6.7257,[195]6.7503,[196]6.7668,[197]6.7887,[198]6.8067,[199]6.8101,[200]6.8141,[201]6.8096,[202]6.8342,[203]6.8424,[204]6.8470,[205]6.8581,[206]6.8657,[207]6.8619,[208]6.8726,[209]6.8780,[210]6.8816,[211]6.8934,[212]6.9011,[213]6.9121,[214]6.9185,[215]6.9231,[216]6.9367,[217]6.9563,[218]6.9708,[219]6.9702,[220]6.9650,[221]6.9601,[222]6.9569,[223]6.9457,[224]6.9389,[225]6.9346,[226]6.9554,[227]6.9661,[228]6.9727,[229]6.9772,[230]6.9739,[231]6.9931,[232]6.9831,[233]6.9635,[234]6.9467,[235]6.9311,[236]6.9226,[237]6.9113,[238]6.9156,[239]6.8994,[240]6.8880,[241]6.8921,[242]6.8965,[243]6.8932,[244]6.8800,[245]6.8776,[246]6.8658,[247]6.8522,[248]6.8445,[249]6.8409,[250]6.8471,[251]6.8399,[252]6.8348,[253]6.8244,[254]6.8182,[255]6.8047,[256]6.7833,[257]6.7706,[258]6.7620,[259]6.7591,[260]6.7503,[261]6.7464,[262]6.7397,[263]6.7344,[264]6.7181,[265]6.7174,[266]6.7143,[267]6.7061,[268]6.7162,[269]6.7142,[270]6.7137,[271]6.7213,[272]6.7260,[273]6.7251,[274]6.7284,[275]6.7389,[276]6.7449,[277]6.7640,[278]6.7748,[279]6.7843,[280]6.7867,[281]6.7962,[282]6.8019,[283]6.8184,[284]6.8264,[285]6.8361,[286]6.8503,[287]6.8506,[288]6.8577,[289]6.8476,[290]6.8306,[291]6.8143,[292]6.7983,[293]6.7837,[294]6.7858,[295]6.7836,[296]6.7889,[297]6.7877,[298]6.7916,[299]6.7881,[300]6.7765,[301]6.7751,[302]6.7663,[303]6.7569,[304]6.7478,[305]6.7439,[306]6.7292,[307]6.7309,[308]6.7356,[309]6.7174,[310]6.7112,[311]6.7044,[312]6.7065,[313]6.6999,[314]6.6979,[315]6.6804,[316]6.6768,[317]6.6583,[318]6.6363,[319]6.6513,[320]6.6646,[321]6.6703,[322]6.6661,[323]6.6598,[324]6.6586,[325]6.6711,[326]6.6704,[327]6.6734,[328]6.6757,[329]6.6834,[330]6.6872,[331]6.7003,[332]6.6962,[333]6.7041,[334]6.6979,[335]6.6901,[336]6.6922,[337]6.6888,[338]6.6896,[339]6.6829,[340]6.6784,[341]6.6871,[342]6.6901,[343]6.6962,[344]6.6964,[345]6.6953,[346]6.6908,[347]6.6946,[348]6.6986,[349]6.6998,[350]6.6957,[351]6.6962,[352]6.6968,[353]6.6902,[354]6.6924,[355]6.6989,[356]6.7021,[357]6.6984,[358]6.7090,[359]6.7119,[360]6.7066,[361]6.7052,[362]6.7144,[363]6.7251,[364]6.7318,[365]6.7372,[366]6.7380,[367]6.7469,[368]6.7429,[369]6.7430,[370]6.7444,[371]6.7381,[372]6.7424,[373]6.7475,[374]6.7446,[375]6.7432,[376]6.7509,[377]6.7460,[378]6.7483,[379]6.7546,[380]6.7455,[381]6.7411,[382]6.7370,[383]6.7360,[384]6.7360,[385]6.7342,[386]6.7340,[387]6.7332,[388]6.7288,[389]6.7223,[390]6.7152,[391]6.7065,[392]6.7026,[393]6.7020,[394]6.7049,[395]6.7036,[396]6.6953,[397]6.7020,[398]6.7058,[399]6.7139,[400]6.7138,[401]6.7148,[402]6.7157,[403]6.7171,[404]6.7237,[405]6.7152,[406]6.7121,[407]6.7122,[408]6.7134,[409]6.7250,[410]6.7371,[411]6.7497,[412]6.7668,[413]6.7790,[414]6.7873,[415]6.7931,[416]6.8026,[417]6.8166,[418]6.8216,[419]6.8294,[420]6.8394,[421]6.8512,[422]6.8556,[423]6.8647,[424]6.8765,[425]6.8857,[426]6.8922,[427]6.8970,[428]6.9061,[429]6.9121,[430]6.9217,[431]6.9381,[432]6.9418,[433]6.9406,[434]6.9354,[435]6.9354,[436]6.9371,[437]6.9470,[438]6.9553,[439]6.9516,[440]6.9503,[441]6.9448,[442]6.9418,[443]6.9432,[444]6.9441,[445]6.9416,[446]6.9440,[447]6.9471,[448]6.9513,[449]6.9487,[450]6.9482,[451]6.9434,[452]6.9351,[453]6.9262,[454]6.9219,[455]6.9226,[456]6.9276,[457]6.9302,[458]6.9276,[459]6.9285,[460]6.9373,[461]6.9347,[462]6.9338,[463]6.9396,[464]6.9390,[465]6.9360,[466]6.9289,[467]6.9300,[468]6.9310,[469]6.9335,[470]6.9342,[471]6.9290,[472]6.9336,[473]6.9272,[474]6.9292,[475]6.9246,[476]6.9274,[477]6.9200,[478]6.9199,[479]6.9293,[480]6.9351,[481]6.9373,[482]6.9325,[483]6.9273,[484]6.9295,[485]6.9287,[486]6.9227,[487]6.9233,[488]6.9219,[489]6.9161,[490]6.9132,[491]6.9109,[492]6.9048,[493]6.9019,[494]6.9008,[495]6.9013,[496]6.8970,[497]6.8923,[498]6.8908,[499]6.8846,[500]6.8749,[501]6.8686,[502]6.8689,[503]6.8673,[504]6.8577,[505]6.8599,[506]6.8605,[507]6.8559,[508]6.8514,[509]6.8506,[510]6.8548,[511]6.8601,[512]6.8632,[513]6.8650,[514]6.8722,[515]6.8664,[516]6.8648,[517]6.8661,[518]6.8655,[519]6.8687,[520]6.8713,[521]6.8724,[522]6.8757,[523]6.8762,[524]6.8826,[525]6.8860,[526]6.8879,[527]6.8899,[528]6.8851,[529]6.8861,[530]6.8799,[531]6.8775,[532]6.8826,[533]6.8846,[534]6.8821,[535]6.8854,[536]6.8789,[537]6.8761,[538]6.8816,[539]6.8823,[540]6.8880,[541]6.8893,[542]6.8899,[543]6.8916,[544]6.8923,[545]6.8906,[546]6.8919,[547]6.8868,[548]6.8803,[549]6.8806,[550]6.8770,[551]6.8732,[552]6.8708,[553]6.8665,[554]6.8638,[555]6.8593,[556]6.8589,[557]6.8627,[558]6.8587,[559]6.8583,[560]6.8578,[561]6.8580,[562]6.8555,[563]6.8559,[564]6.8605,[565]6.8629,[566]6.8624,[567]6.8599,[568]6.8593,[569]6.8572,[570]6.8601,[571]6.8603,[572]6.8609,[573]6.8598,[574]6.8565,[575]6.8570,[576]6.8572,[577]6.8550,[578]6.8537,[579]6.8547,[580]6.8475,[581]6.8430,[582]6.8412,[583]6.8415,[584]6.8412,[585]6.8333,[586]6.8264,[587]6.8268,[588]6.8321,[589]6.8388,[590]6.8420,[591]6.8438,[592]6.8424,[593]6.8377,[594]6.8379,[595]6.8352,[596]6.8387,[597]6.8353,[598]6.8323,[599]6.8340,[600]6.8332,[601]6.8313,[602]6.8341,[603]6.8370,[604]6.8381,[605]6.8409,[606]6.8429,[607]6.8417,[608]6.8373,[609]6.8375,[610]6.8410,[611]6.8390,[612]6.8421,[613]6.8380,[614]6.8332,[615]6.8241,[616]6.8281,[617]6.8211,[618]6.8153,[619]6.8084,[620]6.7923,[621]6.7843,[622]6.7822,[623]6.7841,[624]6.7839,[625]6.7838,[626]6.7822,[627]6.7851,[628]6.7848,[629]6.7844,[630]6.7879,[631]6.7937,[632]6.7994,[633]6.7975,[634]6.8018,[635]6.8029,[636]6.8007,[637]6.7977,[638]6.8010,[639]6.7979,[640]6.7991,[641]6.7990,[642]6.8064,[643]6.8086,[644]6.8094,[645]6.8069,[646]6.8121,[647]6.8090,[648]6.8099,[649]6.8094,[650]6.8144,[651]6.8197,[652]6.8211,[653]6.8252,[654]6.8175,[655]6.8160,
llama_print_timings: load time = 2658.59 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 976009.34 ms / 335360 tokens ( 2.91 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 1004906.69 ms
7B, Q3_4 + Q5_1
main: seed = 1682864367 llama.cpp: loading model from ../build/junk1.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q3_4) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5578.28 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MBsystem_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 1.69 seconds per pass - ETA 18 minutes [1]4.3772,[2]4.7902,[3]5.6644,[4]6.2926,[5]6.4094,[6]6.3682,[7]6.5670,[8]6.6689,[9]7.0093,[10]7.2537,[11]7.4667,[12]7.5133,[13]7.4469,[14]7.5018,[15]7.7581,[16]7.3727,[17]7.2507,[18]7.1917,[19]6.8302,[20]6.8186,[21]6.7251,[22]6.5571,[23]6.5252,[24]6.4336,[25]6.4301,[26]6.2650,[27]6.0943,[28]5.9983,[29]5.9072,[30]5.7545,[31]5.7284,[32]5.7503,[33]5.6977,[34]5.7261,[35]5.7495,[36]5.7812,[37]5.7814,[38]5.7930,[39]5.8255,[40]5.8782,[41]5.8900,[42]5.9274,[43]5.8861,[44]5.9426,[45]5.9455,[46]5.9212,[47]5.9424,[48]5.9169,[49]5.9180,[50]5.8766,[51]5.8735,[52]5.8628,[53]5.9067,[54]5.8907,[55]5.8671,[56]5.8970,[57]5.9166,[58]5.9371,[59]5.9534,[60]5.9951,[61]5.9885,[62]6.0481,[63]6.0787,[64]6.0918,[65]6.1345,[66]6.1431,[67]6.1623,[68]6.1757,[69]6.1978,[70]6.2297,[71]6.2521,[72]6.2830,[73]6.3437,[74]6.3470,[75]6.3602,[76]6.3750,[77]6.3875,[78]6.3715,[79]6.4006,[80]6.3955,[81]6.4061,[82]6.4121,[83]6.3619,[84]6.3428,[85]6.3290,[86]6.3068,[87]6.2470,[88]6.2224,[89]6.2023,[90]6.1887,[91]6.2120,[92]6.2071,[93]6.2058,[94]6.2024,[95]6.2297,[96]6.2300,[97]6.2245,[98]6.2201,[99]6.2056,[100]6.2044,[101]6.2303,[102]6.2271,[103]6.2465,[104]6.2540,[105]6.2524,[106]6.2693,[107]6.2680,[108]6.2817,[109]6.2775,[110]6.2736,[111]6.2939,[112]6.3155,[113]6.3180,[114]6.3130,[115]6.3189,[116]6.3090,[117]6.3133,[118]6.3412,[119]6.3632,[120]6.3973,[121]6.4124,[122]6.4367,[123]6.4732,[124]6.4919,[125]6.4823,[126]6.5207,[127]6.5572,[128]6.5871,[129]6.5713,[130]6.5796,[131]6.5760,[132]6.5691,[133]6.5555,[134]6.5644,[135]6.5596,[136]6.5491,[137]6.5408,[138]6.5234,[139]6.5126,[140]6.5101,[141]6.4830,[142]6.4794,[143]6.4498,[144]6.4283,[145]6.4197,[146]6.4070,[147]6.4104,[148]6.4101,[149]6.4047,[150]6.4007,[151]6.4037,[152]6.3940,[153]6.3789,[154]6.3707,[155]6.3774,[156]6.3727,[157]6.3896,[158]6.3943,[159]6.3981,[160]6.4011,[161]6.4143,[162]6.3863,[163]6.3744,[164]6.3508,[165]6.3203,[166]6.2935,[167]6.2558,[168]6.2253,[169]6.2117,[170]6.2011,[171]6.1750,[172]6.1585,[173]6.1421,[174]6.1119,[175]6.0898,[176]6.0779,[177]6.0581,[178]6.0349,[179]6.0179,[180]6.0087,[181]5.9869,[182]5.9695,[183]5.9546,[184]5.9524,[185]5.9448,[186]5.9454,[187]5.9516,[188]5.9485,[189]5.9654,[190]5.9665,[191]5.9878,[192]6.0035,[193]6.0195,[194]6.0306,[195]6.0520,[196]6.0674,[197]6.0883,[198]6.1032,[199]6.1060,[200]6.1102,[201]6.1052,[202]6.1251,[203]6.1329,[204]6.1330,[205]6.1432,[206]6.1498,[207]6.1461,[208]6.1546,[209]6.1593,[210]6.1638,[211]6.1749,[212]6.1826,[213]6.1933,[214]6.1958,[215]6.1980,[216]6.2120,[217]6.2295,[218]6.2431,[219]6.2435,[220]6.2400,[221]6.2343,[222]6.2324,[223]6.2227,[224]6.2156,[225]6.2119,[226]6.2319,[227]6.2393,[228]6.2448,[229]6.2501,[230]6.2464,[231]6.2637,[232]6.2520,[233]6.2350,[234]6.2204,[235]6.2017,[236]6.1944,[237]6.1850,[238]6.1878,[239]6.1730,[240]6.1623,[241]6.1645,[242]6.1681,[243]6.1662,[244]6.1551,[245]6.1519,[246]6.1415,[247]6.1301,[248]6.1230,[249]6.1208,[250]6.1253,[251]6.1191,[252]6.1150,[253]6.1047,[254]6.0999,[255]6.0884,[256]6.0706,[257]6.0587,[258]6.0507,[259]6.0489,[260]6.0406,[261]6.0367,[262]6.0311,[263]6.0259,[264]6.0068,[265]6.0064,[266]6.0045,[267]5.9985,[268]6.0072,[269]6.0053,[270]6.0054,[271]6.0129,[272]6.0167,[273]6.0167,[274]6.0187,[275]6.0269,[276]6.0326,[277]6.0485,[278]6.0587,[279]6.0675,[280]6.0697,[281]6.0794,[282]6.0850,[283]6.0996,[284]6.1077,[285]6.1164,[286]6.1293,[287]6.1286,[288]6.1348,[289]6.1263,[290]6.1109,[291]6.0960,[292]6.0814,[293]6.0681,[294]6.0702,[295]6.0689,[296]6.0739,[297]6.0731,[298]6.0757,[299]6.0734,[300]6.0626,[301]6.0623,[302]6.0550,[303]6.0461,[304]6.0376,[305]6.0340,[306]6.0213,[307]6.0235,[308]6.0259,[309]6.0104,[310]6.0047,[311]5.9983,[312]6.0006,[313]5.9954,[314]5.9937,[315]5.9781,[316]5.9735,[317]5.9570,[318]5.9369,[319]5.9493,[320]5.9613,[321]5.9657,[322]5.9617,[323]5.9553,[324]5.9524,[325]5.9627,[326]5.9630,[327]5.9652,[328]5.9687,[329]5.9742,[330]5.9772,[331]5.9890,[332]5.9866,[333]5.9933,[334]5.9878,[335]5.9821,[336]5.9857,[337]5.9832,[338]5.9830,[339]5.9774,[340]5.9734,[341]5.9818,[342]5.9841,[343]5.9892,[344]5.9893,[345]5.9891,[346]5.9866,[347]5.9903,[348]5.9941,[349]5.9966,[350]5.9933,[351]5.9943,[352]5.9944,[353]5.9882,[354]5.9888,[355]5.9939,[356]5.9971,[357]5.9933,[358]6.0026,[359]6.0051,[360]6.0016,[361]6.0011,[362]6.0076,[363]6.0186,[364]6.0244,[365]6.0288,[366]6.0300,[367]6.0383,[368]6.0357,[369]6.0366,[370]6.0381,[371]6.0325,[372]6.0374,[373]6.0423,[374]6.0407,[375]6.0407,[376]6.0479,[377]6.0431,[378]6.0457,[379]6.0517,[380]6.0436,[381]6.0401,[382]6.0350,[383]6.0341,[384]6.0336,[385]6.0324,[386]6.0319,[387]6.0321,[388]6.0285,[389]6.0233,[390]6.0163,[391]6.0085,[392]6.0043,[393]6.0027,[394]6.0054,[395]6.0045,[396]5.9971,[397]6.0038,[398]6.0070,[399]6.0144,[400]6.0143,[401]6.0162,[402]6.0175,[403]6.0195,[404]6.0260,[405]6.0171,[406]6.0138,[407]6.0137,[408]6.0151,[409]6.0260,[410]6.0373,[411]6.0488,[412]6.0648,[413]6.0758,[414]6.0837,[415]6.0893,[416]6.0976,[417]6.1097,[418]6.1132,[419]6.1196,[420]6.1286,[421]6.1404,[422]6.1437,[423]6.1507,[424]6.1612,[425]6.1700,[426]6.1762,[427]6.1805,[428]6.1889,[429]6.1939,[430]6.2021,[431]6.2158,[432]6.2192,[433]6.2186,[434]6.2143,[435]6.2147,[436]6.2173,[437]6.2269,[438]6.2344,[439]6.2311,[440]6.2303,[441]6.2254,[442]6.2239,[443]6.2248,[444]6.2251,[445]6.2235,[446]6.2256,[447]6.2287,[448]6.2331,[449]6.2309,[450]6.2317,[451]6.2277,[452]6.2160,[453]6.2077,[454]6.2022,[455]6.2032,[456]6.2077,[457]6.2097,[458]6.2074,[459]6.2078,[460]6.2164,[461]6.2138,[462]6.2124,[463]6.2159,[464]6.2147,[465]6.2119,[466]6.2042,[467]6.2045,[468]6.2042,[469]6.2063,[470]6.2065,[471]6.2018,[472]6.2062,[473]6.2008,[474]6.2022,[475]6.1968,[476]6.1979,[477]6.1909,[478]6.1900,[479]6.1963,[480]6.2009,[481]6.2029,[482]6.1985,[483]6.1947,[484]6.1965,[485]6.1950,[486]6.1894,[487]6.1893,[488]6.1870,[489]6.1822,[490]6.1798,[491]6.1769,[492]6.1713,[493]6.1685,[494]6.1666,[495]6.1657,[496]6.1621,[497]6.1566,[498]6.1552,[499]6.1510,[500]6.1418,[501]6.1353,[502]6.1357,[503]6.1348,[504]6.1260,[505]6.1281,[506]6.1291,[507]6.1237,[508]6.1196,[509]6.1191,[510]6.1228,[511]6.1273,[512]6.1307,[513]6.1326,[514]6.1387,[515]6.1333,[516]6.1324,[517]6.1335,[518]6.1329,[519]6.1359,[520]6.1383,[521]6.1395,[522]6.1422,[523]6.1429,[524]6.1484,[525]6.1514,[526]6.1522,[527]6.1539,[528]6.1487,[529]6.1497,[530]6.1448,[531]6.1435,[532]6.1480,[533]6.1501,[534]6.1479,[535]6.1502,[536]6.1447,[537]6.1427,[538]6.1480,[539]6.1493,[540]6.1532,[541]6.1536,[542]6.1547,[543]6.1562,[544]6.1572,[545]6.1552,[546]6.1558,[547]6.1516,[548]6.1467,[549]6.1467,[550]6.1435,[551]6.1400,[552]6.1379,[553]6.1341,[554]6.1320,[555]6.1288,[556]6.1284,[557]6.1311,[558]6.1276,[559]6.1275,[560]6.1274,[561]6.1276,[562]6.1254,[563]6.1250,[564]6.1294,[565]6.1314,[566]6.1311,[567]6.1289,[568]6.1295,[569]6.1282,[570]6.1314,[571]6.1318,[572]6.1329,[573]6.1328,[574]6.1292,[575]6.1285,[576]6.1285,[577]6.1270,[578]6.1253,[579]6.1258,[580]6.1193,[581]6.1157,[582]6.1148,[583]6.1157,[584]6.1159,[585]6.1085,[586]6.1019,[587]6.1027,[588]6.1075,[589]6.1127,[590]6.1157,[591]6.1177,[592]6.1163,[593]6.1129,[594]6.1139,[595]6.1116,[596]6.1149,[597]6.1126,[598]6.1096,[599]6.1119,[600]6.1112,[601]6.1097,[602]6.1113,[603]6.1138,[604]6.1148,[605]6.1181,[606]6.1204,[607]6.1190,[608]6.1156,[609]6.1163,[610]6.1198,[611]6.1182,[612]6.1210,[613]6.1173,[614]6.1121,[615]6.1047,[616]6.1072,[617]6.1011,[618]6.0962,[619]6.0906,[620]6.0769,[621]6.0702,[622]6.0685,[623]6.0700,[624]6.0706,[625]6.0708,[626]6.0697,[627]6.0722,[628]6.0721,[629]6.0717,[630]6.0749,[631]6.0804,[632]6.0860,[633]6.0846,[634]6.0879,[635]6.0883,[636]6.0857,[637]6.0821,[638]6.0845,[639]6.0814,[640]6.0823,[641]6.0823,[642]6.0888,[643]6.0912,[644]6.0926,[645]6.0911,[646]6.0953,[647]6.0917,[648]6.0926,[649]6.0928,[650]6.0968,[651]6.1019,[652]6.1030,[653]6.1068,[654]6.1003,[655]6.0996,
llama_print_timings: load time = 2722.23 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 1024717.17 ms / 335360 tokens ( 3.06 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 1052144.49 ms
13B, Q2_4 + Q5_1
main: seed = 1682869318 llama.cpp: loading model from ../build/junk3.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 15 (mostly Q2_4) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 8284.13 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MBsystem_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 2.66 seconds per pass - ETA 29 minutes [1]3.9721,[2]4.4565,[3]5.2597,[4]5.9033,[5]6.0338,[6]5.9622,[7]6.1601,[8]6.2656,[9]6.5236,[10]6.7829,[11]7.0013,[12]7.0990,[13]7.0476,[14]7.1859,[15]7.4172,[16]7.0334,[17]6.9135,[18]6.8786,[19]6.5400,[20]6.5070,[21]6.4154,[22]6.2361,[23]6.1969,[24]6.0978,[25]6.1050,[26]5.9363,[27]5.7476,[28]5.6483,[29]5.5673,[30]5.4279,[31]5.3936,[32]5.4050,[33]5.3676,[34]5.4138,[35]5.4320,[36]5.4581,[37]5.4515,[38]5.4411,[39]5.4719,[40]5.5259,[41]5.5538,[42]5.5976,[43]5.5570,[44]5.6013,[45]5.6034,[46]5.5709,[47]5.5914,[48]5.5665,[49]5.5732,[50]5.5425,[51]5.5525,[52]5.5445,[53]5.5919,[54]5.5826,[55]5.5603,[56]5.5833,[57]5.6051,[58]5.6318,[59]5.6509,[60]5.6875,[61]5.6795,[62]5.7390,[63]5.7665,[64]5.7749,[65]5.8128,[66]5.8121,[67]5.8288,[68]5.8395,[69]5.8698,[70]5.9008,[71]5.9260,[72]5.9637,[73]6.0136,[74]6.0186,[75]6.0287,[76]6.0452,[77]6.0630,[78]6.0533,[79]6.0811,[80]6.0773,[81]6.0940,[82]6.0916,[83]6.0430,[84]6.0330,[85]6.0263,[86]6.0075,[87]5.9484,[88]5.9136,[89]5.8907,[90]5.8817,[91]5.9066,[92]5.9045,[93]5.9073,[94]5.9044,[95]5.9318,[96]5.9287,[97]5.9266,[98]5.9230,[99]5.9158,[100]5.9112,[101]5.9382,[102]5.9320,[103]5.9477,[104]5.9492,[105]5.9504,[106]5.9664,[107]5.9629,[108]5.9788,[109]5.9759,[110]5.9701,[111]5.9873,[112]6.0048,[113]6.0053,[114]6.0034,[115]6.0066,[116]5.9956,[117]5.9974,[118]6.0221,[119]6.0409,[120]6.0689,[121]6.0859,[122]6.1071,[123]6.1462,[124]6.1624,[125]6.1543,[126]6.1905,[127]6.2246,[128]6.2549,[129]6.2404,[130]6.2488,[131]6.2446,[132]6.2396,[133]6.2282,[134]6.2393,[135]6.2381,[136]6.2292,[137]6.2257,[138]6.2107,[139]6.2011,[140]6.2007,[141]6.1759,[142]6.1720,[143]6.1485,[144]6.1310,[145]6.1232,[146]6.1096,[147]6.1134,[148]6.1171,[149]6.1134,[150]6.1137,[151]6.1188,[152]6.1085,[153]6.0971,[154]6.0912,[155]6.0971,[156]6.0959,[157]6.1130,[158]6.1163,[159]6.1179,[160]6.1222,[161]6.1343,[162]6.1058,[163]6.0948,[164]6.0712,[165]6.0433,[166]6.0157,[167]5.9800,[168]5.9506,[169]5.9372,[170]5.9281,[171]5.9059,[172]5.8906,[173]5.8760,[174]5.8466,[175]5.8240,[176]5.8097,[177]5.7904,[178]5.7676,[179]5.7527,[180]5.7448,[181]5.7262,[182]5.7087,[183]5.6951,[184]5.6934,[185]5.6880,[186]5.6896,[187]5.6955,[188]5.6950,[189]5.7125,[190]5.7137,[191]5.7318,[192]5.7454,[193]5.7611,[194]5.7725,[195]5.7933,[196]5.8061,[197]5.8252,[198]5.8379,[199]5.8412,[200]5.8429,[201]5.8373,[202]5.8553,[203]5.8624,[204]5.8614,[205]5.8712,[206]5.8764,[207]5.8730,[208]5.8796,[209]5.8831,[210]5.8888,[211]5.8998,[212]5.9059,[213]5.9151,[214]5.9187,[215]5.9211,[216]5.9327,[217]5.9480,[218]5.9628,[219]5.9616,[220]5.9575,[221]5.9532,[222]5.9522,[223]5.9451,[224]5.9380,[225]5.9342,[226]5.9546,[227]5.9632,[228]5.9704,[229]5.9768,[230]5.9741,[231]5.9897,[232]5.9793,[233]5.9619,[234]5.9460,[235]5.9269,[236]5.9199,[237]5.9101,[238]5.9136,[239]5.9005,[240]5.8905,[241]5.8933,[242]5.8951,[243]5.8940,[244]5.8841,[245]5.8807,[246]5.8702,[247]5.8600,[248]5.8524,[249]5.8490,[250]5.8530,[251]5.8444,[252]5.8398,[253]5.8301,[254]5.8258,[255]5.8145,[256]5.7966,[257]5.7849,[258]5.7773,[259]5.7751,[260]5.7663,[261]5.7617,[262]5.7569,[263]5.7506,[264]5.7336,[265]5.7324,[266]5.7288,[267]5.7218,[268]5.7296,[269]5.7292,[270]5.7288,[271]5.7352,[272]5.7392,[273]5.7404,[274]5.7412,[275]5.7480,[276]5.7552,[277]5.7693,[278]5.7775,[279]5.7851,[280]5.7886,[281]5.7987,[282]5.8037,[283]5.8164,[284]5.8253,[285]5.8337,[286]5.8472,[287]5.8436,[288]5.8492,[289]5.8424,[290]5.8276,[291]5.8144,[292]5.8000,[293]5.7865,[294]5.7876,[295]5.7878,[296]5.7926,[297]5.7915,[298]5.7945,[299]5.7927,[300]5.7833,[301]5.7830,[302]5.7765,[303]5.7683,[304]5.7593,[305]5.7559,[306]5.7445,[307]5.7467,[308]5.7477,[309]5.7330,[310]5.7289,[311]5.7237,[312]5.7254,[313]5.7193,[314]5.7175,[315]5.7028,[316]5.7002,[317]5.6864,[318]5.6684,[319]5.6795,[320]5.6913,[321]5.6957,[322]5.6918,[323]5.6866,[324]5.6844,[325]5.6958,[326]5.6973,[327]5.6980,[328]5.7006,[329]5.7052,[330]5.7076,[331]5.7186,[332]5.7145,[333]5.7218,[334]5.7162,[335]5.7112,[336]5.7145,[337]5.7124,[338]5.7122,[339]5.7075,[340]5.7049,[341]5.7114,[342]5.7142,[343]5.7184,[344]5.7186,[345]5.7190,[346]5.7163,[347]5.7199,[348]5.7239,[349]5.7266,[350]5.7247,[351]5.7257,[352]5.7258,[353]5.7197,[354]5.7213,[355]5.7262,[356]5.7298,[357]5.7267,[358]5.7352,[359]5.7368,[360]5.7326,[361]5.7319,[362]5.7393,[363]5.7502,[364]5.7557,[365]5.7603,[366]5.7622,[367]5.7718,[368]5.7690,[369]5.7704,[370]5.7726,[371]5.7684,[372]5.7738,[373]5.7777,[374]5.7762,[375]5.7754,[376]5.7821,[377]5.7776,[378]5.7799,[379]5.7845,[380]5.7770,[381]5.7744,[382]5.7703,[383]5.7691,[384]5.7695,[385]5.7677,[386]5.7658,[387]5.7652,[388]5.7614,[389]5.7571,[390]5.7516,[391]5.7449,[392]5.7416,[393]5.7411,[394]5.7436,[395]5.7422,[396]5.7362,[397]5.7439,[398]5.7481,[399]5.7549,[400]5.7535,[401]5.7535,[402]5.7548,[403]5.7576,[404]5.7632,[405]5.7513,[406]5.7473,[407]5.7472,[408]5.7478,[409]5.7592,[410]5.7692,[411]5.7790,[412]5.7942,[413]5.8061,[414]5.8126,[415]5.8184,[416]5.8260,[417]5.8361,[418]5.8395,[419]5.8448,[420]5.8530,[421]5.8638,[422]5.8671,[423]5.8735,[424]5.8834,[425]5.8915,[426]5.8979,[427]5.9019,[428]5.9094,[429]5.9138,[430]5.9201,[431]5.9333,[432]5.9357,[433]5.9345,[434]5.9307,[435]5.9313,[436]5.9338,[437]5.9427,[438]5.9507,[439]5.9471,[440]5.9469,[441]5.9426,[442]5.9404,[443]5.9406,[444]5.9423,[445]5.9409,[446]5.9426,[447]5.9445,[448]5.9481,[449]5.9461,[450]5.9467,[451]5.9429,[452]5.9315,[453]5.9228,[454]5.9169,[455]5.9165,[456]5.9207,[457]5.9221,[458]5.9201,[459]5.9199,[460]5.9272,[461]5.9232,[462]5.9207,[463]5.9215,[464]5.9202,[465]5.9184,[466]5.9109,[467]5.9116,[468]5.9107,[469]5.9121,[470]5.9116,[471]5.9063,[472]5.9090,[473]5.9043,[474]5.9051,[475]5.8991,[476]5.8992,[477]5.8916,[478]5.8895,[479]5.8934,[480]5.8974,[481]5.8984,[482]5.8939,[483]5.8903,[484]5.8919,[485]5.8881,[486]5.8823,[487]5.8820,[488]5.8798,[489]5.8753,[490]5.8725,[491]5.8696,[492]5.8638,[493]5.8606,[494]5.8583,[495]5.8567,[496]5.8530,[497]5.8476,[498]5.8459,[499]5.8416,[500]5.8336,[501]5.8267,[502]5.8268,[503]5.8254,[504]5.8169,[505]5.8170,[506]5.8175,[507]5.8126,[508]5.8087,[509]5.8087,[510]5.8112,[511]5.8161,[512]5.8199,[513]5.8225,[514]5.8285,[515]5.8238,[516]5.8229,[517]5.8235,[518]5.8235,[519]5.8256,[520]5.8277,[521]5.8293,[522]5.8309,[523]5.8315,[524]5.8376,[525]5.8404,[526]5.8412,[527]5.8427,[528]5.8372,[529]5.8381,[530]5.8332,[531]5.8322,[532]5.8378,[533]5.8404,[534]5.8386,[535]5.8415,[536]5.8367,[537]5.8347,[538]5.8397,[539]5.8405,[540]5.8437,[541]5.8446,[542]5.8459,[543]5.8479,[544]5.8487,[545]5.8477,[546]5.8480,[547]5.8440,[548]5.8391,[549]5.8387,[550]5.8373,[551]5.8341,[552]5.8323,[553]5.8286,[554]5.8264,[555]5.8236,[556]5.8227,[557]5.8249,[558]5.8214,[559]5.8214,[560]5.8201,[561]5.8207,[562]5.8182,[563]5.8181,[564]5.8226,[565]5.8238,[566]5.8239,[567]5.8217,[568]5.8222,[569]5.8203,[570]5.8232,[571]5.8238,[572]5.8245,[573]5.8244,[574]5.8218,[575]5.8204,[576]5.8199,[577]5.8177,[578]5.8157,[579]5.8151,[580]5.8094,[581]5.8061,[582]5.8056,[583]5.8061,[584]5.8063,[585]5.8000,[586]5.7940,[587]5.7944,[588]5.7986,[589]5.8039,[590]5.8066,[591]5.8080,[592]5.8069,[593]5.8037,[594]5.8046,[595]5.8026,[596]5.8064,[597]5.8039,[598]5.8005,[599]5.8030,[600]5.8020,[601]5.8005,[602]5.8020,[603]5.8043,[604]5.8051,[605]5.8076,[606]5.8089,[607]5.8077,[608]5.8045,[609]5.8050,[610]5.8095,[611]5.8080,[612]5.8100,[613]5.8066,[614]5.8025,[615]5.7953,[616]5.7980,[617]5.7922,[618]5.7870,[619]5.7819,[620]5.7695,[621]5.7639,[622]5.7617,[623]5.7633,[624]5.7634,[625]5.7641,[626]5.7634,[627]5.7661,[628]5.7669,[629]5.7676,[630]5.7704,[631]5.7748,[632]5.7803,[633]5.7787,[634]5.7821,[635]5.7818,[636]5.7784,[637]5.7751,[638]5.7773,[639]5.7740,[640]5.7749,[641]5.7755,[642]5.7812,[643]5.7830,[644]5.7850,[645]5.7839,[646]5.7875,[647]5.7830,[648]5.7841,[649]5.7844,[650]5.7874,[651]5.7912,[652]5.7914,[653]5.7952,[654]5.7891,[655]5.7880,
llama_print_timings: load time = 3757.20 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 1660958.51 ms / 335360 tokens ( 4.95 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 1690237.58 ms
13B, Q3_4 + Q5_1
main: seed = 1682871034 llama.cpp: loading model from ../build/junk2.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q3_4) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 9353.66 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MBsystem_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 2.78 seconds per pass - ETA 30 minutes [1]3.8638,[2]4.2438,[3]5.0285,[4]5.4691,[5]5.6500,[6]5.5749,[7]5.7279,[8]5.8483,[9]6.0969,[10]6.3309,[11]6.5275,[12]6.5851,[13]6.5289,[14]6.6307,[15]6.8287,[16]6.5065,[17]6.4160,[18]6.3973,[19]6.1030,[20]6.0757,[21]6.0022,[22]5.8315,[23]5.8015,[24]5.7102,[25]5.7213,[26]5.5727,[27]5.3968,[28]5.2983,[29]5.2181,[30]5.0845,[31]5.0477,[32]5.0637,[33]5.0127,[34]5.0546,[35]5.0779,[36]5.0973,[37]5.0901,[38]5.0859,[39]5.1136,[40]5.1553,[41]5.1784,[42]5.2145,[43]5.1772,[44]5.2185,[45]5.2210,[46]5.1948,[47]5.2217,[48]5.2018,[49]5.2068,[50]5.1752,[51]5.1815,[52]5.1747,[53]5.2203,[54]5.2098,[55]5.1910,[56]5.2134,[57]5.2298,[58]5.2522,[59]5.2691,[60]5.3031,[61]5.2955,[62]5.3497,[63]5.3745,[64]5.3859,[65]5.4235,[66]5.4215,[67]5.4387,[68]5.4507,[69]5.4795,[70]5.5091,[71]5.5318,[72]5.5665,[73]5.6155,[74]5.6227,[75]5.6330,[76]5.6482,[77]5.6602,[78]5.6456,[79]5.6718,[80]5.6661,[81]5.6766,[82]5.6737,[83]5.6292,[84]5.6184,[85]5.6138,[86]5.5979,[87]5.5348,[88]5.4920,[89]5.4704,[90]5.4603,[91]5.4818,[92]5.4778,[93]5.4791,[94]5.4792,[95]5.5064,[96]5.5025,[97]5.4993,[98]5.4960,[99]5.4891,[100]5.4861,[101]5.5101,[102]5.5053,[103]5.5205,[104]5.5248,[105]5.5265,[106]5.5405,[107]5.5396,[108]5.5541,[109]5.5528,[110]5.5473,[111]5.5658,[112]5.5820,[113]5.5806,[114]5.5798,[115]5.5831,[116]5.5710,[117]5.5707,[118]5.5948,[119]5.6120,[120]5.6409,[121]5.6560,[122]5.6767,[123]5.7139,[124]5.7325,[125]5.7269,[126]5.7621,[127]5.7945,[128]5.8224,[129]5.8098,[130]5.8178,[131]5.8138,[132]5.8098,[133]5.7973,[134]5.8061,[135]5.8063,[136]5.7966,[137]5.7923,[138]5.7782,[139]5.7695,[140]5.7679,[141]5.7406,[142]5.7374,[143]5.7121,[144]5.6965,[145]5.6883,[146]5.6761,[147]5.6809,[148]5.6838,[149]5.6807,[150]5.6801,[151]5.6846,[152]5.6780,[153]5.6682,[154]5.6629,[155]5.6696,[156]5.6674,[157]5.6831,[158]5.6844,[159]5.6853,[160]5.6893,[161]5.7012,[162]5.6757,[163]5.6658,[164]5.6443,[165]5.6185,[166]5.5949,[167]5.5631,[168]5.5362,[169]5.5232,[170]5.5140,[171]5.4926,[172]5.4804,[173]5.4673,[174]5.4398,[175]5.4193,[176]5.4058,[177]5.3890,[178]5.3683,[179]5.3554,[180]5.3479,[181]5.3314,[182]5.3149,[183]5.3022,[184]5.3012,[185]5.2933,[186]5.2949,[187]5.3005,[188]5.2977,[189]5.3145,[190]5.3146,[191]5.3317,[192]5.3451,[193]5.3599,[194]5.3711,[195]5.3902,[196]5.4024,[197]5.4218,[198]5.4356,[199]5.4370,[200]5.4385,[201]5.4321,[202]5.4455,[203]5.4518,[204]5.4465,[205]5.4555,[206]5.4606,[207]5.4563,[208]5.4615,[209]5.4655,[210]5.4714,[211]5.4821,[212]5.4887,[213]5.4977,[214]5.5005,[215]5.5040,[216]5.5156,[217]5.5323,[218]5.5463,[219]5.5465,[220]5.5432,[221]5.5386,[222]5.5383,[223]5.5318,[224]5.5251,[225]5.5218,[226]5.5412,[227]5.5471,[228]5.5546,[229]5.5616,[230]5.5581,[231]5.5732,[232]5.5628,[233]5.5476,[234]5.5337,[235]5.5123,[236]5.5070,[237]5.4979,[238]5.5009,[239]5.4895,[240]5.4804,[241]5.4834,[242]5.4855,[243]5.4848,[244]5.4748,[245]5.4715,[246]5.4611,[247]5.4513,[248]5.4448,[249]5.4415,[250]5.4449,[251]5.4366,[252]5.4321,[253]5.4228,[254]5.4190,[255]5.4096,[256]5.3932,[257]5.3831,[258]5.3762,[259]5.3759,[260]5.3675,[261]5.3624,[262]5.3579,[263]5.3528,[264]5.3296,[265]5.3294,[266]5.3264,[267]5.3203,[268]5.3272,[269]5.3268,[270]5.3278,[271]5.3341,[272]5.3373,[273]5.3387,[274]5.3400,[275]5.3461,[276]5.3526,[277]5.3652,[278]5.3739,[279]5.3822,[280]5.3861,[281]5.3953,[282]5.4006,[283]5.4136,[284]5.4226,[285]5.4297,[286]5.4423,[287]5.4388,[288]5.4440,[289]5.4380,[290]5.4237,[291]5.4105,[292]5.3974,[293]5.3855,[294]5.3862,[295]5.3865,[296]5.3913,[297]5.3903,[298]5.3920,[299]5.3896,[300]5.3805,[301]5.3810,[302]5.3746,[303]5.3659,[304]5.3587,[305]5.3564,[306]5.3456,[307]5.3488,[308]5.3494,[309]5.3358,[310]5.3325,[311]5.3280,[312]5.3295,[313]5.3237,[314]5.3223,[315]5.3092,[316]5.3057,[317]5.2930,[318]5.2766,[319]5.2871,[320]5.2986,[321]5.3035,[322]5.3005,[323]5.2960,[324]5.2941,[325]5.3036,[326]5.3052,[327]5.3061,[328]5.3092,[329]5.3144,[330]5.3167,[331]5.3270,[332]5.3231,[333]5.3307,[334]5.3259,[335]5.3205,[336]5.3228,[337]5.3216,[338]5.3212,[339]5.3170,[340]5.3142,[341]5.3207,[342]5.3238,[343]5.3283,[344]5.3288,[345]5.3303,[346]5.3287,[347]5.3323,[348]5.3360,[349]5.3380,[350]5.3362,[351]5.3372,[352]5.3372,[353]5.3320,[354]5.3332,[355]5.3378,[356]5.3408,[357]5.3376,[358]5.3456,[359]5.3477,[360]5.3443,[361]5.3442,[362]5.3513,[363]5.3619,[364]5.3670,[365]5.3709,[366]5.3726,[367]5.3815,[368]5.3790,[369]5.3802,[370]5.3821,[371]5.3781,[372]5.3829,[373]5.3872,[374]5.3851,[375]5.3848,[376]5.3909,[377]5.3874,[378]5.3900,[379]5.3941,[380]5.3870,[381]5.3839,[382]5.3800,[383]5.3783,[384]5.3780,[385]5.3769,[386]5.3759,[387]5.3759,[388]5.3730,[389]5.3694,[390]5.3640,[391]5.3582,[392]5.3547,[393]5.3541,[394]5.3571,[395]5.3565,[396]5.3512,[397]5.3575,[398]5.3618,[399]5.3687,[400]5.3677,[401]5.3684,[402]5.3694,[403]5.3719,[404]5.3775,[405]5.3626,[406]5.3584,[407]5.3576,[408]5.3585,[409]5.3694,[410]5.3786,[411]5.3883,[412]5.4022,[413]5.4125,[414]5.4185,[415]5.4245,[416]5.4314,[417]5.4411,[418]5.4435,[419]5.4482,[420]5.4558,[421]5.4656,[422]5.4689,[423]5.4741,[424]5.4831,[425]5.4906,[426]5.4966,[427]5.5008,[428]5.5078,[429]5.5114,[430]5.5176,[431]5.5305,[432]5.5337,[433]5.5328,[434]5.5292,[435]5.5305,[436]5.5332,[437]5.5416,[438]5.5489,[439]5.5459,[440]5.5453,[441]5.5407,[442]5.5392,[443]5.5403,[444]5.5420,[445]5.5411,[446]5.5430,[447]5.5452,[448]5.5483,[449]5.5468,[450]5.5479,[451]5.5449,[452]5.5295,[453]5.5197,[454]5.5145,[455]5.5148,[456]5.5189,[457]5.5200,[458]5.5180,[459]5.5180,[460]5.5252,[461]5.5215,[462]5.5181,[463]5.5165,[464]5.5162,[465]5.5143,[466]5.5069,[467]5.5058,[468]5.5037,[469]5.5049,[470]5.5041,[471]5.4993,[472]5.5002,[473]5.4956,[474]5.4943,[475]5.4876,[476]5.4859,[477]5.4775,[478]5.4748,[479]5.4757,[480]5.4784,[481]5.4786,[482]5.4740,[483]5.4698,[484]5.4706,[485]5.4645,[486]5.4580,[487]5.4569,[488]5.4541,[489]5.4487,[490]5.4458,[491]5.4424,[492]5.4359,[493]5.4332,[494]5.4314,[495]5.4290,[496]5.4250,[497]5.4188,[498]5.4162,[499]5.4126,[500]5.4045,[501]5.3978,[502]5.3970,[503]5.3960,[504]5.3887,[505]5.3887,[506]5.3894,[507]5.3838,[508]5.3802,[509]5.3806,[510]5.3827,[511]5.3868,[512]5.3907,[513]5.3932,[514]5.3987,[515]5.3948,[516]5.3937,[517]5.3936,[518]5.3936,[519]5.3960,[520]5.3972,[521]5.3983,[522]5.4000,[523]5.4006,[524]5.4058,[525]5.4084,[526]5.4089,[527]5.4106,[528]5.4052,[529]5.4058,[530]5.4021,[531]5.4018,[532]5.4066,[533]5.4091,[534]5.4073,[535]5.4099,[536]5.4055,[537]5.4036,[538]5.4085,[539]5.4093,[540]5.4112,[541]5.4111,[542]5.4125,[543]5.4145,[544]5.4157,[545]5.4145,[546]5.4147,[547]5.4114,[548]5.4073,[549]5.4074,[550]5.4054,[551]5.4028,[552]5.4010,[553]5.3980,[554]5.3957,[555]5.3938,[556]5.3928,[557]5.3943,[558]5.3909,[559]5.3915,[560]5.3903,[561]5.3906,[562]5.3881,[563]5.3879,[564]5.3921,[565]5.3932,[566]5.3936,[567]5.3919,[568]5.3927,[569]5.3912,[570]5.3937,[571]5.3951,[572]5.3961,[573]5.3968,[574]5.3937,[575]5.3920,[576]5.3915,[577]5.3898,[578]5.3881,[579]5.3883,[580]5.3830,[581]5.3801,[582]5.3801,[583]5.3810,[584]5.3814,[585]5.3756,[586]5.3698,[587]5.3703,[588]5.3746,[589]5.3796,[590]5.3827,[591]5.3843,[592]5.3831,[593]5.3792,[594]5.3805,[595]5.3790,[596]5.3829,[597]5.3809,[598]5.3778,[599]5.3806,[600]5.3795,[601]5.3784,[602]5.3789,[603]5.3818,[604]5.3826,[605]5.3855,[606]5.3871,[607]5.3854,[608]5.3826,[609]5.3834,[610]5.3875,[611]5.3863,[612]5.3887,[613]5.3859,[614]5.3819,[615]5.3760,[616]5.3784,[617]5.3734,[618]5.3688,[619]5.3644,[620]5.3533,[621]5.3482,[622]5.3463,[623]5.3477,[624]5.3481,[625]5.3489,[626]5.3484,[627]5.3511,[628]5.3519,[629]5.3525,[630]5.3553,[631]5.3598,[632]5.3644,[633]5.3633,[634]5.3663,[635]5.3660,[636]5.3623,[637]5.3585,[638]5.3606,[639]5.3574,[640]5.3579,[641]5.3584,[642]5.3637,[643]5.3654,[644]5.3676,[645]5.3662,[646]5.3699,[647]5.3651,[648]5.3664,[649]5.3667,[650]5.3695,[651]5.3737,[652]5.3742,[653]5.3779,[654]5.3724,[655]5.3715,
llama_print_timings: load time = 3902.02 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 1737619.18 ms / 335360 tokens ( 5.18 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 1764928.09 ms
Great idea! Do you think the RMSE metric can be somehow used to determine which quantization mode is best for a given layer? (in order to avoid recomputing perplexity) Or maybe some other metric / heuristic? Entropy?
I'd not compare it to audio, that variable bitrate is something totally different. But I also considered trying that, what would be needed is an automated benchmarking tool that can permutate the quantization per layer on the fly (by using the full model on a box with enough RAM) and return quality score and performance, vram for each variant. That could be kept running and in the end we'd have a nice table that shows which combinations are most successful.
There would also be merit in testing to keep the encoder block higher res and the decoder lowres also the early layers high res and the late layers lower resolution. My reasoning is that the error accumulates with each layer, so a small error introduced in layer 1 might multiply into a big error in layer 40 but if we start with a higher quality calculation that gradually lowers we might get better results. (That's why encoder and decoder should also be treated differently)
My guess is that a locally adaptive variable bit rate would require a major change to ggml.
The easiest way I can think of would be to vary allocation within a row - since we always quantize/dequantize/dot vector by row. You would have to allocate a "worst case" length since the row size is fixed, but you could still save memory accesses on more compressible rows. I have yet to come up with an encoding that would be close to performant, though.
Suppose we want to quantize only half of the feed_forward.w2 and attention.wv tensors with a higher number of bits to reduce the model size compared to what I posted above. What would be the best strategy to do that? My initial intuition was that we pick the layers with the highest rmse to quantize with more bits. That strategy turned out to be the worst. @cmp-nct suggests that it should be the 1st half of the layers as errors made there would magnify down the network. This is better, but still not the best. Here is a list of strategies given in decreasing order of performance:
- Quantize first 1/4, then every 3rd layer with more bits
- Quantize every second layer with more bits
- Quantize the first half with more bits
- Quantize the highest-rmse half with more bits (which basically turns out to be the second half of the layers).
I have added a bunch of new quantizations to play with in a quest to find the smallest model that gives a "reasonable" performance:
Q6_0: Same asQ4_0andQ5_0, but uses 6 bits. This results in a perplexity of5.9662(7B) or5.2540(13B)Q5_K: "Super-blocks" of 256 weights. Quants are 5 bits. There is a singlefp16scale per super-block. Each set (block) of 16 weights within a "super-block" has its own scale quantized to 8 bits. This results in5.5625bits per weight in total, so almost the same asQ5_0. It gives a perplexity of5.9881(7B) or5.2719(13B), so better thanQ5_0at the same model size.Q3_K: AsQ5_K, but using 3 bits per quant, so3.5625bits per weight.Q4_K: AsQ5_K, but using 4 bits per quant, so4.5625bits per weight
Here are some model sizes and perplexities where
output.weightis always quantized withQ6_0feed_forward.w2andattention.wvare quantized with a mix ofQ3_KandQ5_Kas follows- "Tiny": quantize first quarter, then every 3rd of
feed_forward.w2andattention.wvwithQ5_K - "Small": quantize first quarter, then every 3rd of
attention.wvwithQ5_K, all layers offeed_forward.w2withQ5_K - "Normal" quantize all layers of
feed_forward.w2andattention.wvwithQ5_K
- "Tiny": quantize first quarter, then every 3rd of
- All other tensors are quantized with
Q3_K - A "Large" model, which uses the same strategy as "Tiny", but
Q3_Kis replaced withQ4_Kquantization.
| Model | File size | 7B Perplexity |
|---|---|---|
| Tiny | 3.07G | 6.3063 |
| Small | 3.24G | 6.2322 |
| Normal | 3.30G | 6.1688 |
| Large | 3.72G | 6.0347 |
They all satisfy the requirement to be usable on a phone or within a web browser by a comfortable margin. Considering that the llama.cpp 7B quantized perplexity given on the main page was 6.59 just a few weeks ago, I think their performance should be seen as "reasonable". The "Large" model is ~10% smaller than any of the 4-bit quantizations currently listed on the main page, while outperforming them by a comfortable margin.
If I arbitrarily define a perplexity of 6.59 as the threshold for "reasonable" performance, I'm not able to find a 2-bit quantization scheme that has a "reasonable" performance and is smaller than the 3-bit models in the above table.
One thing that came to my mind during my latest tests: when running multiplications on Nvidia GPUs we are forced into 16 bit multiplications. Only the next generation of GPUs will likely support the native FP8 format (currently only on Hopper GPUs). Also running lower precision on custom cuda kernels does not seem to work well (extreme performance loss), I suppose a high effort project could close the performance gap (whatever magic clBlast is doing to get so close to cuBLAS)
When thinking about mixing precision of layers, we should keep in mind that caluclations on GPU are already 16 bit. Any lower precision does not gain performance for those cases, it loses performance from conversions.
Update: that's not relevant anymore
I encountered some other interesting methods and benchmarks here, just as background https://github.com/megvii-research/Sparsebit/blob/main/large_language_models/llama/quantization/README.md
Closed via #1684
One thing that came to my mind during my latest tests: when running multiplications on Nvidia GPUs we are forced into 16 bit multiplications. Only the next generation of GPUs will likely support the native FP8 format (currently only on Hopper GPUs). Also running lower precision on custom cuda kernels does not seem to work well (extreme performance loss), I suppose a high effort project could close the performance gap (whatever magic clBlast is doing to get so close to cuBLAS)
When thinking about mixing precision of layers, we should keep in mind that caluclations on GPU are already 16 bit. Any lower precision does not gain performance for those cases, it loses performance from conversions.
Update: that's not relevant anymore
How is it not relevant anymore?
What does clBlast do? Is it better than cuBLAS when using quantizied models?