llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

2-bit integer quantization

Open ggerganov opened this issue 1 year ago • 12 comments

Add Q2_0 and Q2_1 quantization support to ggml:

  • Follow the existing Q4_0 and Q4_1 implementations
  • Implement reference scalar quantization and dequantization routines
  • I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
  • Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
  • No need to implement ggml_vec_mad_q2() - these will be deprecated soon
  • Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

  • Q2_0 - 3.2 GB

For QK == 32 we have:

  • Q2_0 - 2.4 GB
  • Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

ggerganov avatar Mar 24 '23 06:03 ggerganov