compilade comments

Results 108 comments of


                                            compilade

Quantize: specify each major tensor quant in CLI for common LLMs

Note that MoE tensors *are* handled by this PR in the same way FFN tensors are handled: ```c name.find("ffn_up") != std::string::npos ``` also matches the stacked expert tensors https://github.com/ggerganov/llama.cpp/blob/afd27f01fe832ece3d07ef03b7d34a9e80c4a895/src/llama.cpp#L586 And...

Bug: cannot find tokenizer merges in model file

Upgrading to `transformers 4.45` likely isn't enough; `gguf.SpecialVocab(dir_model, load_merges=True)` only works with the old format while silently ignoring everything else: https://github.com/ggerganov/llama.cpp/blob/8277a817f18967581b02b2248989d773e8e99998/gguf-py/gguf/vocab.py#L123-L126 > I wonder, should we try to find a...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

> I wonder if Q2_2 could be made faster if we used a block size of say 256 like the K-quants Can't go with bigger blocks than 64 elements or...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

Whew, it has been a month since I last touched this, I got distracted for a bit. (tl;dr at the end) Now that new ternary models like [TriLMs](https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked) exist (),...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

> we already have a workaround via padding for such kind of models @ggerganov While it mostly works, padding like in isn't correct with `ggml_rms_norm`, because the row size is...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

> Is there a consistent way to extract the bit pattern structure from the source code? It's a bit hard to grok the superblock, blocks and how bits are being...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

I've made some preliminary performance (speed) tests with `TQ1_0` and `TQ2_0`, and `TQ1_0` is faster than `Q1_3`, now around the speed of `Q8_0`, while `TQ2_0` got a ***very big*** perf...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

I've tested that a round-trip quantization between `TQ1_0` and `TQ2_0` is lossless, which means one can always be made from the other. ```console $ ./build/bin/llama-quantize models/trilm-390M-f16.gguf models/trilm-390M-tq1_0.gguf tq1_0 $ ./build/bin/llama-quantize...

ggml-quants : ternary packing for TriLMs and BitNet b1.58

> I see perplexity looks too good for tq1_0 and tq2_0 .... too good to be true ;) Keep in mind these types were only tested on models which were...

imatrix : use GGUF to store importance matrices

I'm setting this to "draft", because of concerns by @ikawrakow in and (mostly related to the fact that GGUF is harder to parse than `imatrix.dat` files). More details near the...