llama.cpp
llama.cpp copied to clipboard
Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors
The current Q4_0
uses a single F32 floating-point scaling factor.
An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93
Initial results indicate that this might be as accurate as Q4_1
and hopefully as fast as current Q4_0
.
The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0
data format and potentially dropping support for Q4_1
.
SIMD implementation progress
- [x] ARM NEON
- [x] AVX
- [ ] WASM
I plan to work on the ARM NEON implementation. If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.
Related
- #397
- #896