llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors

Open ggerganov opened this issue 1 year ago • 0 comments

The current Q4_0 uses a single F32 floating-point scaling factor.

An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial results indicate that this might be as accurate as Q4_1 and hopefully as fast as current Q4_0.

The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0 data format and potentially dropping support for Q4_1.

SIMD implementation progress

  • [x] ARM NEON
  • [x] AVX
  • [ ] WASM

I plan to work on the ARM NEON implementation. If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.

Related

  • #397
  • #896

ggerganov avatar Apr 15 '23 12:04 ggerganov