Kawrakow

Results 18 issues of Kawrakow

### Update After seeing PR #835, I pushed some more changes that only affect the `Q4_0` results. I now get ``` rmse = 0.00185228 ``` for the 7B model. Perplexity...

research 🔬

I was surprised by the belief that the dot product `x * y`, where `x` holds quantized model weights and `y` contains floating point values, it is faster to quantize...

For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct

The idea behind being that `Q8_0` quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot...

performance

The PR adds a new build option (`LLAMA_NO_RMSE`), which is off by default. When off, all current quantization types (`Q4_0, Q4_1, Q4_2, Q4_3`) are performed with RMSE minimization (on master...

high priority
generation quality

Variable bit rate is commonly used in audio and video compression, so why not try on LLMs? My guess is that a locally adaptive variable bit rate would require a...

enhancement
generation quality
Less than 4 bits

Implemented mostly following the `Q4_0` Metal implementation. Slightly slower than `Q4_0`: on my 30-core M2 Max GPU and `256` tokens it takes `28.1` ms/token compared to `27.0` ms/token for `Q4_0`.

27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation...

As discussed [elsewhere](https://github.com/ggerganov/llama.cpp/pull/6840#issuecomment-2079823076), here is a PR that improves AVX2 prompt processing for k-quants and `IQ4_XS` by a large margin. I did not manage to get the speed gains via...

It seems some people still use the `ggml` legacy qunats `Q4_0, Q4_1, Q5_0` and `Q5_1`, so here is a PR that improves matrix multiplication performance for these quants on AVX2....