llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

New IQ1_S somehow much worse than previous version

Open CISC opened this issue 11 months ago • 26 comments

Since #5971 I tried requantizing IQ1_S of this model, using the same imatrix as before, however, where the following worked as expected 75% of the time (and the rest of the time it just gave the wrong output):

./main --log-disable --no-display-prompt -t 7 -ngl 35 -m gorilla-openfunctions-v2.IQ1_S.gguf --color -c 16384 --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "

The newly quantized version just outputs gibberish like this, every time:

45° CelsiusIEEEeqnarray---classvrtexmalinkmalinkndefinedndefinedndefinedndefined---Título:Taxonomia螃---Título:Taxonomia

Which seems like a pretty massive regression, any idea what's going on?

CISC avatar Mar 11 '24 11:03 CISC

using the same imatrix

Try with a new imatrix?

BarfingLemurs avatar Mar 11 '24 13:03 BarfingLemurs

@BarfingLemurs That makes no sense, why would the imatrix need to be changed?

Anyway, tried again with #5999 and it's no longer spouting gibberish, however it's still worse than before, it's just outputting this now:

ndefinedError,"message":"No function named 'get_current_weather'."

And given variations of the same prompt makes it leap to various wrong conclusions, so not doing that great, the closest I can get it to something (but still very wrong, unlike the old IQ1_S) is with Stockholm instead of Oslo:

5023[{"name":"get_current_weather", "parameters": {"location": ["Stockholm, Sweden"], "unit": "celsius"}}]

CISC avatar Mar 11 '24 19:03 CISC

@ikawrakow Any feedback appreciated, I can provide you with whatever you need to help figure this out.

CISC avatar Mar 11 '24 19:03 CISC

previous imatrix files have - https://github.com/ggerganov/llama.cpp/issues/5856#issuecomment-1986871620

I don't know if there will be issues running imatrix on gpus, so I use the cpu backend.

BarfingLemurs avatar Mar 11 '24 21:03 BarfingLemurs

@CISC I'm unable to test this model. I cloned the model from [email protected]:gorilla-llm/gorilla-openfunctions-v2. My attempt to convert with the convert.py script was greeted with this message:

Traceback (most recent call last):
  File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.

I added the --pad-vocab option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I get

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

It looks like something is not quite right with the vocabulary?

ikawrakow avatar Mar 12 '24 05:03 ikawrakow

But when attempting to run an imatrix calculation

Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.

BrickBee avatar Mar 12 '24 08:03 BrickBee

@ikawrakow All DeepSeek models require --pad-vocab but I had no problems calculating an imatrix, in fact just tried again with the latest build and still works fine, so that's pretty weird...

It seems the issue with IQ1_S is random though as today I'm getting gibberish again from the same IQ1_S model that "worked" yesterday. Tried without GPU just to make sure it wasn't some CUDA issue, but same thing, very strange. All other quants also work just fine.

I've uploaded my original gguf conversion here in case you want to test with that. Can be downloaded with

curl -L -O https://huggingface.co/CISCai/gorilla-openfunctions-v2-SOTA-GGUF/resolve/main/gorilla-openfunctions-v2.fp16.gguf

CISC avatar Mar 12 '24 10:03 CISC

Just to make sure nothing else is broken I also quickly requantized IQ2_XXS with the latest build and tested it, works perfectly:

 <<function>>get_current_weather(location="Oslo")

CISC avatar Mar 12 '24 10:03 CISC

@CISC

  • Downloaded the fp16 GGUF from the link you provided
  • Ran ./bin/imatrix -m ../models/gorilla/ggml-model-f16.gguf -t 1 -ngl 100 -f ../tests/wiki.train.raw --chunks 100 -o gorilla_imatrix.dat
  • Ran ./bin/quantize --imatrix gorilla_imatrix.dat ../models/gorilla/ggml-model-f16.gguf iq1s.gguf iq1_s

After that, I get the exact same response from the IQ1_S model as from the fp16 model:

./bin/main --log-disable --no-display-prompt -t 1 -ngl 100 -m iq1s.gguf --color --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
 <<function>>get_current_weather(location='Oslo')

Also got the same response running on the CPU (AVX2).

WinkiText2 PPL is 11.09 for fp16 and 21.09 for IQ1_S, so fully in line of what I would expect for a 7B model (~2X increase in PPL compared to fp16).

The model behavior typically does not depend on how the model slept, when it got up in the morning, did it have coffee, etc. Hence, given the random behavior you are observing, something is not quite right in your setup.

ikawrakow avatar Mar 12 '24 11:03 ikawrakow

@ikawrakow That's what's so weird, why is it only affecting IQ1_S? As I said, all other quants are working fine, even after requantizing with latest build. I've even made sure to do a make clean, just in case...

CISC avatar Mar 12 '24 11:03 CISC

I'm getting gibberish with your imatrix too (5023tessa-147890tessa-147890tessa-147890 is the response in my case). I also get gibberish from another WikiText2 imatrix that uses 1000 chunks.

IQ1_S is not really meant for serious use. It is there to satisfy people's curiosity about 1-bit models that are being hyped around the Internet. As far as I can tell, IQ1_S outperforms by a significant margin everything else sub 2-bit that has been published (as measured by PPL). This should give you a concept of what you can expect from other sub 2-bit quantizations.

ikawrakow avatar Mar 12 '24 12:03 ikawrakow

@ikawrakow Now we're getting somewhere, I first tried just regenerating the imatrix the same way I did originally (just to make sure there was nothing wrong with is, as suggested by @BarfingLemurs ), but while it did generate completely different values in the imatrix (is there some randomness to the generation?) the resulting quantization remained the same. Then after you mentioned smaller chunks of data made a difference I tried again with -c 4096 --chunks 10 and now the resulting quantization is starting to make some sense (but still a little off), this is with What's the weather like in Oslo and Stockholm?:

5023ikipediaAI(get_current_weather)<<function>>get_current_weather(location='Oslo, Norway')<<function>>get_current_weather(location='Stockholm, Sweden')

I'm wondering if there's something wrong with the imatrix application of IQ1_S, especially when the imatrix has been generated over larger amounts of data?

CISC avatar Mar 12 '24 13:03 CISC

@ikawrakow I've been digging through the IQ1_S quantizing functions and made the following changes that seems to fix the problem:

diff --git a/ggml-quants.c b/ggml-quants.c
index 06665eb2..936f9122 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -11539,6 +11539,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
 
     float  scales[QK_K/IQ1S_BLOCK_SIZE];
     float  weight[IQ1S_BLOCK_SIZE];
+    float  waux[IQ1S_BLOCK_SIZE];
     int8_t L[IQ1S_BLOCK_SIZE];
     float  sumx[IQ1S_BLOCK_SIZE+1];
     float  sumw[IQ1S_BLOCK_SIZE+1];
@@ -11558,12 +11559,13 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
         const float * xbl = x + QK_K*ibl;
         float sumx2 = 0;
         for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
-        float sigma2 = 2*sumx2/QK_K;
+        float sigma2 = sumx2/QK_K;
 
         for (int ib = 0; ib < QK_K/IQ1S_BLOCK_SIZE; ++ib) {
             const float * xb = xbl + IQ1S_BLOCK_SIZE*ib;
             const float * qw = quant_weights + QK_K*ibl + IQ1S_BLOCK_SIZE*ib;
             for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
+            for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) waux[i] = sqrtf(weight[i]);
             float max = fabsf(xb[0]);
             for (int i = 1; i < IQ1S_BLOCK_SIZE; ++i) max = MAX(max, fabsf(xb[i]));
             if (!max) {
@@ -11625,7 +11627,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
                 if (grid_index < 0) {
                     all_on_grid = false;
                     const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
-                    grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, weight + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
+                    grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, waux + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
                     GGML_ASSERT(grid_index >= 0);
                 }
                 index[k] = grid_index;

If you concur I will submit a PR.

CISC avatar Mar 14 '24 12:03 CISC

@CISC I specifically made the code to be the way it is because it does give a lower PPL for the 9 models I'm testing. I'm traveling for a few days without access to the computer where I keep my research notes. Let me get back first and see what difference each one of these makes (unless you want to run PPL for all 7 LLaMAs, Mistral, and Mixtral8x7)

ikawrakow avatar Mar 15 '24 09:03 ikawrakow

@CISC I'm unable to test this model. I cloned the model from [email protected]:gorilla-llm/gorilla-openfunctions-v2. My attempt to convert with the convert.py script was greeted with this message:

Traceback (most recent call last):
  File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.

I added the --pad-vocab option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I get

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

It looks like something is not quite right with the vocabulary?

@ikawrakow hello, I got the same error "Aborted (core dumped)" with deepseek coder model in q5_0 quantize, did you solve it? please. but in another platform, the reseaon is the CPU can not compute with AVX2 et al.

hyperbolic-c avatar Mar 15 '24 09:03 hyperbolic-c

@ikawrakow It's probably best that you run the tests to ensure all the variables are the same (and that I haven't made a mistake). I can wait. :)

CISC avatar Mar 15 '24 11:03 CISC

it looks the llamacpp support about deepseek-coder coming soon # 5981

hyperbolic-c avatar Mar 15 '24 12:03 hyperbolic-c

But when attempting to run an imatrix calculation

Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.

@BrickBee hello, yes, I got the same process with DeepSeek-coder model, did you have solved it ?

hyperbolic-c avatar Mar 16 '24 05:03 hyperbolic-c

@hyperbolic-c Ah, I remember I also had to use --vocab-type bpe when converting otherwise it would choose the wrong tokenizer.

CISC avatar Mar 17 '24 12:03 CISC

@hyperbolic-c Ah, I remember I also had to use --vocab-type bpe when converting otherwise it would choose the wrong tokenizer.

@CISC yes, but another error is weird, which like @ikawrakow showed

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

hyperbolic-c avatar Mar 18 '24 11:03 hyperbolic-c

@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.

If it still doesn't work for you, perhaps you should open another issue?

CISC avatar Mar 18 '24 11:03 CISC

@CISC

Here is a table that compares PPL between master and your proposed changes. To not complicate things, values are computed with the default rms_norm_epsilon.

Model PPL (master) PPL (proposed)
LLaMA-v1-7B 13.9500 14.0957
LLaMA-v2-7B 13.0038 13.4071
Mistral-7B 10.4249 10.5174
LLaMA-v1-13B 8.7396 8.7304
LLaMA-v2-13B 7.8220 7.9472

Based on this, I think we need more evidence that the proposed change is better.

ikawrakow avatar Mar 20 '24 07:03 ikawrakow

@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.

If it still doesn't work for you, perhaps you should open another issue?

@CISC Thanks. It did not work for the DeepSeek-coder model. Maybe llama.cpp not be fully support DeepSeek model yet(see #5981)

hyperbolic-c avatar Mar 20 '24 07:03 hyperbolic-c

@ikawrakow Interesting, apart from LLaMA-v2-7B it wasn't much of a difference though.

However the difference in actual output with my imatrix on the gorilla model is night and day, from gibberish to completely correct, so something is obviously going on. Given that it seems to be a matter of how much data has been used to generate the imatrix I'm inclined to believe the PPL degradation is coincidental (or rather that the previous PPL might have accidentally been better than it should), or of course that there's still something not quite right, even after my changes. :)

Either way, I agree that it needs to be looked at more closely, but IQ1_S definitely does not work as intended as-is.

CISC avatar Mar 20 '24 09:03 CISC

@ikawrakow I don't know if you've had time to look at this or not, but I've been trying to determine if my changes have any real-world adverse impact with various models, but so far everything looks good. However it's difficult to determine exactly what kind of effect this would have and what to look for, so it's hard to get any definitive answer.

I'm thinking it might make sense to create a draft PR and invite a few of the usual suspects who publish IQ1_S quants on HF? If nothing else, we might be able to start a discussion and organize some testing.

CISC avatar Mar 24 '24 09:03 CISC

Sure, submit a PR and have some other people test it.

ikawrakow avatar Mar 24 '24 17:03 ikawrakow

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar May 09 '24 01:05 github-actions[bot]