llama.cpp
llama.cpp copied to clipboard
Quantization does not write the quantization version to `ftype`
Expected Behavior
When quantizing with llama.cpp, the quantization version should be written to the ftype
in the hyperparameters.
Current Behavior
A ftype
is produced by llama_model_quantize_internal
and is passed through as-is to llama_file_saver
, which writes it to disk without encoding it using GGML_QNT_VERSION
:
https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L2052-L2068
https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L557
Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:
Running `target/release/llm llama info -m models/llama/7B/koala-7B.ggmlv3.q5_1.bin`
[2023-05-25T00:10:05Z INFO llm] Container type: Ggjt(3)
[2023-05-25T00:10:05Z INFO llm] Hyperparameters: Hyperparameters { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, file_type: FileType { format: MostlyQ5_1, quantization_version: 0 } }
[2023-05-25T00:10:05Z INFO llm] Vocabulary size: 32000
Environment and Context
This was reproduced on https://github.com/ggerganov/llama.cpp/commit/ac7876ac20124a15a44fd6317721ff1aa2538806. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
-
make
-
./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
- Check the
ftype
in the written hyperparameters.