Kawrakow
Kawrakow
### Update After seeing PR #835, I pushed some more changes that only affect the `Q4_0` results. I now get ``` rmse = 0.00185228 ``` for the 7B model. Perplexity...
I was surprised by the belief that the dot product `x * y`, where `x` holds quantized model weights and `y` contains floating point values, it is faster to quantize...
For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct
The idea behind being that `Q8_0` quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot...
The PR adds a new build option (`LLAMA_NO_RMSE`), which is off by default. When off, all current quantization types (`Q4_0, Q4_1, Q4_2, Q4_3`) are performed with RMSE minimization (on master...
Variable bit rate is commonly used in audio and video compression, so why not try on LLMs? My guess is that a locally adaptive variable bit rate would require a...
Implemented mostly following the `Q4_0` Metal implementation. Slightly slower than `Q4_0`: on my 30-core M2 Max GPU and `256` tokens it takes `28.1` ms/token compared to `27.0` ms/token for `Q4_0`.
27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation...
As discussed [elsewhere](https://github.com/ggerganov/llama.cpp/pull/6840#issuecomment-2079823076), here is a PR that improves AVX2 prompt processing for k-quants and `IQ4_XS` by a large margin. I did not manage to get the speed gains via...
It seems some people still use the `ggml` legacy qunats `Q4_0, Q4_1, Q5_0` and `Q5_1`, so here is a PR that improves matrix multiplication performance for these quants on AVX2....