llama.cpp
llama.cpp copied to clipboard
2-bit integer quantization
Add Q2_0
and Q2_1
quantization support to ggml
:
- Follow the existing
Q4_0
andQ4_1
implementations - Implement reference scalar quantization and dequantization routines
- I suspect we might have to use
QK == 16
in this case to compensate for further accuracy losses - Add SIMD support for a specific architecture - investigate best strategy to perform the
ggml_vec_dot_q2()
computation - No need to implement
ggml_vec_mad_q2()
- these will be deprecated soon - Compute perplexity scores
The expected model sizes for 7B and QK == 16
are:
-
Q2_0
- 3.2 GB
For QK == 32
we have:
-
Q2_0
- 2.4 GB -
Q2_1
- 3.2 GB
Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.