Results 99 comments of Stephan Walter

@ikawrakow did that in #896, see `kQuantizeQ4` in ggml_extra.cpp, but that's for a new quantization scheme. https://github.com/ggerganov/llama.cpp/blob/6bfb00a53b1a06e209f1b814356dd79ee96b89af/ggml_extra.cpp#L287-L291 It did indeed speed things up. This could probably be integrated into `llama_model_quantize_internal`...

There is now an assert that checks `mem_buffer`, even in non-debug builds: https://github.com/ggerganov/llama.cpp/blob/173d0e6419e8f8f3c1f4f13201b777f4c60629f3/ggml.c#L4571 Closing this as it's quite old, please re-open if you still encounter the problem with a recent...

It's also missing from the description for `--outtype`. According to the readme, you would use `quantize` if you wanted q4_0 or q4_1, right?

Well #1083 was a bit rushed IMO, but I tried to address the loose ends. For the horizontal sum of ints, I could not see a difference in speed between...

Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for `quantize_row_q8_0`, but I'm not sure I can trust the...

Did it work for you with commit 2a2e63c and can you narrow down the commit that broke it? In #1237, I changed some `size_t` parameters to `int`, I'm now worrying...

No complaints after three weeks, let's assume this is fixed, possibly by #252.

The python dependencies in .devops/full.Dockerfile should also be updated, will conflict with my PR #293.

Presumably fixed by #563, please re-open if it's still an issue with a recent revision.