llama.cpp
llama.cpp copied to clipboard
Q4_1 inference appears broken for 13B parameters
I have been experimenting with q4_1 quantisation (since some preliminary results suggest it shold perform better), and noticed that something about the pipeline for the 13B parameter model is broken (whether it is the quantization itself, or the saving or loading). This results in all inferred tokens coming out as #
. Meanwhile, 7B works well.
I know we had a patch a while ago that first made the 13B+ models work for q4_0 - did whatever fixes it made not cover q4_1?
Yes, I use the 13B model and it doesn't work properly for either chatting or tasks. It simply ends without any response, which is very confusing to me.

First - great work!
Most likely the cause is that when I changed the Q4_0
scaling storage, I skipped the Q4_1
routines:
https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6eb56678fdee4c09219ddb85b640
This change was necessary to make larger model works, because when we merge rows from different shards, the scaling factors have to be next to the integer quants. Originally, what happened, is the scaling factors of a row, ended up at the start of the memory buffer, before all the int quants, and this caused the merging in main.cpp
to fail.
In short: you need to rearrange the scaling and offset factors for a chunk to be located right before the int quants in the memory buffer in order to have correct merging of the shards.
P.S. Btw it is strange that you don't see any asserts firing up. Are you building with -NDEBUG
while developing? You should see infs
or nans
at the very start of the inference if you remove -NDEBUG
I didn't use -NDEBUG
, I am a newcomer to this, just converted the 13B model using the latest code from the main branch,no errors were reported during the conversion.
make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "What is the best programming language in the world? Why?" -t 8 -n 512
When using the 7B model, it works fine.

In 13B Q4_1 mode:
I think it seems to work now, so I might make a PR against the main repository.
Great, thank you!