Quantization does not write the quantization version to `ftype`

Open philpax opened this issue 1 year ago • 12 comments

Expected Behavior

When quantizing with llama.cpp, the quantization version should be written to the ftype in the hyperparameters.

Current Behavior

A ftype is produced by llama_model_quantize_internal and is passed through as-is to llama_file_saver, which writes it to disk without encoding it using GGML_QNT_VERSION:

https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L2052-L2068

https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L557

Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:

     Running `target/release/llm llama info -m models/llama/7B/koala-7B.ggmlv3.q5_1.bin`
[2023-05-25T00:10:05Z INFO  llm] Container type: Ggjt(3)
[2023-05-25T00:10:05Z INFO  llm] Hyperparameters: Hyperparameters { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, file_type: FileType { format: MostlyQ5_1, quantization_version: 0 } }
[2023-05-25T00:10:05Z INFO  llm] Vocabulary size: 32000

Environment and Context

This was reproduced on https://github.com/ggerganov/llama.cpp/commit/ac7876ac20124a15a44fd6317721ff1aa2538806. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

make
./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
Check the ftype in the written hyperparameters.

May 25 '23 00:05 philpax

llama.cpp llama.cpp copied to clipboard

Quantization does not write the quantization version to `ftype`

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

llama.cpp
llama.cpp copied to clipboard