llama.cpp 2-bit integer quantization

2-bit integer quantization

Open ggerganov opened this issue 1 year ago • 12 comments

Add Q2_0 and Q2_1 quantization support to ggml:

Follow the existing Q4_0 and Q4_1 implementations
Implement reference scalar quantization and dequantization routines
I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
No need to implement ggml_vec_mad_q2() - these will be deprecated soon
Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

Q2_0 - 3.2 GB

For QK == 32 we have:

Q2_0 - 2.4 GB
Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

Mar 24 '23 06:03 ggerganov

llama.cpp llama.cpp copied to clipboard

2-bit integer quantization

llama.cpp
llama.cpp copied to clipboard