llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

perf: parallelize quantization

Open jon-chuang opened this issue 2 years ago • 2 comments

https://github.com/ggerganov/llama.cpp/blob/8b679987cdce292ff36bd741f6715e4927e26f9b/llama.cpp#L1558

Is currently single threaded. Quantization is quite slow (vicuna 7B: 65156.31 ms, vicuna 13B: 129902.48 ms).

jon-chuang avatar Apr 12 '23 03:04 jon-chuang

@ikawrakow did that in #896, see kQuantizeQ4 in ggml_extra.cpp, but that's for a new quantization scheme. https://github.com/ggerganov/llama.cpp/blob/6bfb00a53b1a06e209f1b814356dd79ee96b89af/ggml_extra.cpp#L287-L291

It did indeed speed things up. This could probably be integrated into llama_model_quantize_internal so that a new cpp module isn't necessary.

sw avatar Apr 12 '23 12:04 sw

Is the new quantization scheme the one that minimizes MSE against the original weights?

jon-chuang avatar Apr 12 '23 14:04 jon-chuang

Resolved by #1075

sw avatar Apr 22 '23 17:04 sw