llama.cpp
llama.cpp copied to clipboard
perf: parallelize quantization
https://github.com/ggerganov/llama.cpp/blob/8b679987cdce292ff36bd741f6715e4927e26f9b/llama.cpp#L1558
Is currently single threaded. Quantization is quite slow (vicuna 7B: 65156.31 ms, vicuna 13B: 129902.48 ms).
@ikawrakow did that in #896, see kQuantizeQ4 in ggml_extra.cpp, but that's for a new quantization scheme. https://github.com/ggerganov/llama.cpp/blob/6bfb00a53b1a06e209f1b814356dd79ee96b89af/ggml_extra.cpp#L287-L291
It did indeed speed things up. This could probably be integrated into llama_model_quantize_internal so that a new cpp module isn't necessary.
Is the new quantization scheme the one that minimizes MSE against the original weights?
Resolved by #1075