llama.cpp
llama.cpp copied to clipboard
New IQ1_S somehow much worse than previous version
Since #5971 I tried requantizing IQ1_S of this model, using the same imatrix as before, however, where the following worked as expected 75% of the time (and the rest of the time it just gave the wrong output):
./main --log-disable --no-display-prompt -t 7 -ngl 35 -m gorilla-openfunctions-v2.IQ1_S.gguf --color -c 16384 --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "
The newly quantized version just outputs gibberish like this, every time:
45° CelsiusIEEEeqnarray---classvrtexmalinkmalinkndefinedndefinedndefinedndefined---Título:Taxonomia螃---Título:Taxonomia
Which seems like a pretty massive regression, any idea what's going on?
@BarfingLemurs That makes no sense, why would the imatrix need to be changed?
Anyway, tried again with #5999 and it's no longer spouting gibberish, however it's still worse than before, it's just outputting this now:
ndefinedError,"message":"No function named 'get_current_weather'."
And given variations of the same prompt makes it leap to various wrong conclusions, so not doing that great, the closest I can get it to something (but still very wrong, unlike the old IQ1_S) is with Stockholm instead of Oslo:
5023[{"name":"get_current_weather", "parameters": {"location": ["Stockholm, Sweden"], "unit": "celsius"}}]
@ikawrakow Any feedback appreciated, I can provide you with whatever you need to help figure this out.
previous imatrix files have - https://github.com/ggerganov/llama.cpp/issues/5856#issuecomment-1986871620
I don't know if there will be issues running imatrix on gpus, so I use the cpu backend.
@CISC I'm unable to test this model. I cloned the model from [email protected]:gorilla-llm/gorilla-openfunctions-v2
. My attempt to convert with the convert.py
script was greeted with this message:
Traceback (most recent call last):
File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module>
main()
File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main
OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size
raise Exception(msg)
Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.
I added the --pad-vocab
option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I get
compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at
Aborted (core dumped)
It looks like something is not quite right with the vocabulary?
But when attempting to run an imatrix calculation
Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.
@ikawrakow All DeepSeek models require --pad-vocab but I had no problems calculating an imatrix, in fact just tried again with the latest build and still works fine, so that's pretty weird...
It seems the issue with IQ1_S is random though as today I'm getting gibberish again from the same IQ1_S model that "worked" yesterday. Tried without GPU just to make sure it wasn't some CUDA issue, but same thing, very strange. All other quants also work just fine.
I've uploaded my original gguf conversion here in case you want to test with that. Can be downloaded with
curl -L -O https://huggingface.co/CISCai/gorilla-openfunctions-v2-SOTA-GGUF/resolve/main/gorilla-openfunctions-v2.fp16.gguf
Just to make sure nothing else is broken I also quickly requantized IQ2_XXS with the latest build and tested it, works perfectly:
<<function>>get_current_weather(location="Oslo")
@CISC
- Downloaded the
fp16
GGUF from the link you provided - Ran
./bin/imatrix -m ../models/gorilla/ggml-model-f16.gguf -t 1 -ngl 100 -f ../tests/wiki.train.raw --chunks 100 -o gorilla_imatrix.dat
- Ran
./bin/quantize --imatrix gorilla_imatrix.dat ../models/gorilla/ggml-model-f16.gguf iq1s.gguf iq1_s
After that, I get the exact same response from the IQ1_S
model as from the fp16
model:
./bin/main --log-disable --no-display-prompt -t 1 -ngl 100 -m iq1s.gguf --color --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
<<function>>get_current_weather(location='Oslo')
Also got the same response running on the CPU (AVX2).
WinkiText2 PPL is 11.09
for fp16
and 21.09
for IQ1_S
, so fully in line of what I would expect for a 7B model (~2X increase in PPL compared to fp16
).
The model behavior typically does not depend on how the model slept, when it got up in the morning, did it have coffee, etc. Hence, given the random behavior you are observing, something is not quite right in your setup.
@ikawrakow That's what's so weird, why is it only affecting IQ1_S? As I said, all other quants are working fine, even after requantizing with latest build. I've even made sure to do a make clean
, just in case...
I'm getting gibberish with your imatrix too (5023tessa-147890tessa-147890tessa-147890
is the response in my case). I also get gibberish from another WikiText2 imatrix that uses 1000 chunks.
IQ1_S
is not really meant for serious use. It is there to satisfy people's curiosity about 1-bit models that are being hyped around the Internet. As far as I can tell, IQ1_S
outperforms by a significant margin everything else sub 2-bit that has been published (as measured by PPL). This should give you a concept of what you can expect from other sub 2-bit quantizations.
@ikawrakow Now we're getting somewhere, I first tried just regenerating the imatrix the same way I did originally (just to make sure there was nothing wrong with is, as suggested by @BarfingLemurs ), but while it did generate completely different values in the imatrix (is there some randomness to the generation?) the resulting quantization remained the same. Then after you mentioned smaller chunks of data made a difference I tried again with -c 4096 --chunks 10
and now the resulting quantization is starting to make some sense (but still a little off), this is with What's the weather like in Oslo and Stockholm?
:
5023ikipediaAI(get_current_weather)<<function>>get_current_weather(location='Oslo, Norway')<<function>>get_current_weather(location='Stockholm, Sweden')
I'm wondering if there's something wrong with the imatrix application of IQ1_S, especially when the imatrix has been generated over larger amounts of data?
@ikawrakow I've been digging through the IQ1_S quantizing functions and made the following changes that seems to fix the problem:
diff --git a/ggml-quants.c b/ggml-quants.c
index 06665eb2..936f9122 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -11539,6 +11539,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
float scales[QK_K/IQ1S_BLOCK_SIZE];
float weight[IQ1S_BLOCK_SIZE];
+ float waux[IQ1S_BLOCK_SIZE];
int8_t L[IQ1S_BLOCK_SIZE];
float sumx[IQ1S_BLOCK_SIZE+1];
float sumw[IQ1S_BLOCK_SIZE+1];
@@ -11558,12 +11559,13 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
- float sigma2 = 2*sumx2/QK_K;
+ float sigma2 = sumx2/QK_K;
for (int ib = 0; ib < QK_K/IQ1S_BLOCK_SIZE; ++ib) {
const float * xb = xbl + IQ1S_BLOCK_SIZE*ib;
const float * qw = quant_weights + QK_K*ibl + IQ1S_BLOCK_SIZE*ib;
for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
+ for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) waux[i] = sqrtf(weight[i]);
float max = fabsf(xb[0]);
for (int i = 1; i < IQ1S_BLOCK_SIZE; ++i) max = MAX(max, fabsf(xb[i]));
if (!max) {
@@ -11625,7 +11627,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
if (grid_index < 0) {
all_on_grid = false;
const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
- grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, weight + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
+ grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, waux + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
GGML_ASSERT(grid_index >= 0);
}
index[k] = grid_index;
If you concur I will submit a PR.
@CISC I specifically made the code to be the way it is because it does give a lower PPL for the 9 models I'm testing. I'm traveling for a few days without access to the computer where I keep my research notes. Let me get back first and see what difference each one of these makes (unless you want to run PPL for all 7 LLaMAs, Mistral, and Mixtral8x7)
@CISC I'm unable to test this model. I cloned the model from
[email protected]:gorilla-llm/gorilla-openfunctions-v2
. My attempt to convert with theconvert.py
script was greeted with this message:Traceback (most recent call last): File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module> main() File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab, File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all check_vocab_size(params, vocab, pad_vocab=pad_vocab) File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size raise Exception(msg) Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.
I added the
--pad-vocab
option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I getcompute_imatrix: tokenizing the input .. terminate called after throwing an instance of 'std::out_of_range' what(): _Map_base::at Aborted (core dumped)
It looks like something is not quite right with the vocabulary?
@ikawrakow hello, I got the same error "Aborted (core dumped)" with deepseek coder model in q5_0 quantize, did you solve it? please. but in another platform, the reseaon is the CPU can not compute with AVX2 et al.
@ikawrakow It's probably best that you run the tests to ensure all the variables are the same (and that I haven't made a mistake). I can wait. :)
it looks the llamacpp support about deepseek-coder coming soon # 5981
But when attempting to run an imatrix calculation
Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.
@BrickBee hello, yes, I got the same process with DeepSeek-coder model, did you have solved it ?
@hyperbolic-c Ah, I remember I also had to use --vocab-type bpe
when converting otherwise it would choose the wrong tokenizer.
@hyperbolic-c Ah, I remember I also had to use
--vocab-type bpe
when converting otherwise it would choose the wrong tokenizer.
@CISC yes, but another error is weird, which like @ikawrakow showed
compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at
Aborted (core dumped)
@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.
If it still doesn't work for you, perhaps you should open another issue?
@CISC
Here is a table that compares PPL between master and your proposed changes. To not complicate things, values are computed with the default rms_norm_epsilon
.
Model | PPL (master) | PPL (proposed) |
---|---|---|
LLaMA-v1-7B | 13.9500 | 14.0957 |
LLaMA-v2-7B | 13.0038 | 13.4071 |
Mistral-7B | 10.4249 | 10.5174 |
LLaMA-v1-13B | 8.7396 | 8.7304 |
LLaMA-v2-13B | 7.8220 | 7.9472 |
Based on this, I think we need more evidence that the proposed change is better.
@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.
If it still doesn't work for you, perhaps you should open another issue?
@CISC Thanks. It did not work for the DeepSeek-coder model. Maybe llama.cpp not be fully support DeepSeek model yet(see #5981)
@ikawrakow Interesting, apart from LLaMA-v2-7B it wasn't much of a difference though.
However the difference in actual output with my imatrix on the gorilla model is night and day, from gibberish to completely correct, so something is obviously going on. Given that it seems to be a matter of how much data has been used to generate the imatrix I'm inclined to believe the PPL degradation is coincidental (or rather that the previous PPL might have accidentally been better than it should), or of course that there's still something not quite right, even after my changes. :)
Either way, I agree that it needs to be looked at more closely, but IQ1_S definitely does not work as intended as-is.
@ikawrakow I don't know if you've had time to look at this or not, but I've been trying to determine if my changes have any real-world adverse impact with various models, but so far everything looks good. However it's difficult to determine exactly what kind of effect this would have and what to look for, so it's hard to get any definitive answer.
I'm thinking it might make sense to create a draft PR and invite a few of the usual suspects who publish IQ1_S quants on HF? If nothing else, we might be able to start a discussion and organize some testing.
Sure, submit a PR and have some other people test it.
This issue was closed because it has been inactive for 14 days since being marked as stale.