llama.cpp Q4_1 inference appears broken for 13B parameters

I have been experimenting with q4_1 quantisation (since some preliminary results suggest it shold perform better), and noticed that something about the pipeline for the 13B parameter model is broken (whether it is the quantization itself, or the saving or loading). This results in all inferred tokens coming out as #. Meanwhile, 7B works well.

I know we had a patch a while ago that first made the 13B+ models work for q4_0 - did whatever fixes it made not cover q4_1?

Mar 15 '23 03:03 blackhole89

Yes, I use the 13B model and it doesn't work properly for either chatting or tasks. It simply ends without any response, which is very confusing to me.

Mar 15 '23 05:03 tisfeng

First - great work!

Most likely the cause is that when I changed the Q4_0 scaling storage, I skipped the Q4_1 routines:

https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6eb56678fdee4c09219ddb85b640

This change was necessary to make larger model works, because when we merge rows from different shards, the scaling factors have to be next to the integer quants. Originally, what happened, is the scaling factors of a row, ended up at the start of the memory buffer, before all the int quants, and this caused the merging in main.cpp to fail.

In short: you need to rearrange the scaling and offset factors for a chunk to be located right before the int quants in the memory buffer in order to have correct merging of the shards.

P.S. Btw it is strange that you don't see any asserts firing up. Are you building with -NDEBUG while developing? You should see infs or nans at the very start of the inference if you remove -NDEBUG

Mar 15 '23 06:03 ggerganov

I didn't use -NDEBUG, I am a newcomer to this, just converted the 13B model using the latest code from the main branch，no errors were reported during the conversion.

make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "What is the best programming language in the world? Why?" -t 8 -n 512

When using the 7B model, it works fine.

Mar 15 '23 08:03 tisfeng

In 13B Q4_1 mode:

I think it seems to work now, so I might make a PR against the main repository.

Mar 15 '23 23:03 blackhole89

Great, thank you!

Mar 16 '23 02:03 tisfeng

llama.cpp llama.cpp copied to clipboard

Q4_1 inference appears broken for 13B parameters

llama.cpp
llama.cpp copied to clipboard