llama.cpp
llama.cpp copied to clipboard
GGML_ASSERT: ../llama.cpp/ggml-quants.c:10340: grid_index >= 0
I am converting many models to gguf, and (imatrix-) quantize them afterwards. Today, all my jobs failed with the error from the title (many output lines, probably one per thread).
This might be incidental, but I also upgraded from a checkout from March 6th to b2405.
Example model:
https://huggingface.co/Sao10K/Solstice-Mixtral-v1
imatrix file: https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF/blob/main/imatrix.dat
Failing invocation:
quantize --allow-requantize --leave-output-tensor --imatrix imatrix.dat Mythical-Destroyer-V2-L2-13B.gguf IQ2_M
https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF
The normal Q-quants quantized fine (see https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF), it's only when it moved to IQ2_M that it fails (that's the first I-quant my script generates).
Actually, by now, only 5 of the 8 jobs failed, so it clearly depends on the model and is not universal.
Using your imatrix and the vanilla Mixtral Instruct 0.1 also asserts on M2 Ultra. If I use the imatrix from https://huggingface.co/datasets/ikawrakow/imatrix-from-wiki-train/blob/main/mixtral-8x7b-instruct-v0.1.imatrix it does not assert. Not sure what this means, but providing this data point to help investigate
Just noticed this during another imatrix calculation:
[20]17.0770,[21]17.3345,[22]17.2957,[23]17.5153,[24]17.4855,[25]17.4871,[26]17.4550,[27]17.4515,[28]17.3787,[29]17.2030,[30]17.1163,[31]16.9776,[32]16.9705,[33]16.8447,[34]16.7977,[35]16.6126,[36]16.4577,[37]16.3338,[38]16.1736,[39]16.2237, save_imatrix: stored collected data after 40 chunks in miqu-1-120b.imatrix~ [40]16.3695,[41]16.2985,[42]16.2147,[43]16.1276,[44]16.1080,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan
Could be incidental, but I don't think I have noticed "nan" values with the March 6 llama.cpp I used before. But maybe I did and maybe its normal. I will report if miqu-1-120b fails or not, and if this has anything to do with it.
It should never produce nan
- this is either a bug or the input data is invalid
miqu-1-120b failed the same way.
I'm quantizing models for quite a while now, many miqu variants made the same way. It's very unlikely that I just happened ot pick invalid models in the last few days, and the hundred or so models before happened to be fine. This is something that must have been introduced between March 6 and b2405, somehow. Or an incompatiblity between cuda 12.3 and 12.4, as that was also upgraded.
Since considerable work and computing time has been spent on this and potentially needs to be redone, it would be nice if somebody could shed some light on what that error means, and if its the imatrix that is broken (and needs to be redone) or the quantize process using it (and whether the non-I-quants are affected and need to be redone - this probably affects about 60 repositories by now). I will try to downgrade to the march 6 version I used before (if I can identify the exact release) and will try to redo an imatrix, to see if the nan problem goes away.
save_imatrix: stored collected data after 40 chunks in miqu-1-120b.imatrix~ [40]16.3695,[41]16.2985,[42]16.2147,[43]16.1276,[44]16.1080,[45]15.9874,[46]16.0110,[47]15.9119,
nan's do not appear with b2355.
Can you share specific instructions to reproduce the nan
issue? Ideally with the smallest model that you are aware that has the issue. Or run a git bisect
to find the commit that introduced the issue.
With the Q8 quant from here: https://huggingface.co/mradermacher/BagelMix-8x7B-GGUF
And this invocation: imatrix -ofreq 20 -t 4 -ngl 0 -mg 0 -m ... -o ... -f ...
I reliably get nan after 5 chunks, on two different computers (to exclude a hardware problem). I cannot share the training data, unfortunately :( I'll try to replicate it with something else, but I can't replicate it with e.g. group_10_merged.txt, so it might be somehow training-data dependent (although it does not happen at the same chunk for other models).
Actually, I wrote this when the test with group_10_merged.txt was at the end, but it did give a nan result near the end with (https://github.com/ggerganov/llama.cpp/files/14143306/group_10_merged.txt btw.)
The BagelMix nan happens with b2355 as well. And the nan issue correlatres with the crash at least in one case (miqu-1-120b). Also, the same imatrix fails the same way with cuda 12.3 and 12.4.
To summarize my findings (because this ticket is a bit confusing).
The nan issue happens deterministically, i.e. with the same imatrix binary, model and training data I get nan in the same chunk each time. It happens on my main workstation with either my rtx 4090, rtx 4070, or on my other machine with an rtx 4060, so it is not a hardware problem.
It does NOT happen with "CUDA_VISIBLE_DEVICES=" so it seems to be a cuda issue.
It happens with the Mar 6 (b2355) version as well, but in a different chunk (or maybe it is not deterministic with miqu-1-120b, because that worked with Mar 6 but not with the Mar 11 binary, one test only).
I have done many imatrix quants in the last two months with the same imatrix input, and it only started to happen recently - either because I was unlucky with models, or because something changed recently (it is possible, but not likely, that quantize started to crash only recently and the nan issue is older - I have not checked the output of imatrix runs). I have been wrong about it being the Mar 6 to Mar 11 upgrade, because miqu-1-120b failed with the mar 11 version but worked with the mar 6 one, but BagelMix fails with the Mar 6 version as well.
I was able to reproduce the issue with wikitext-2-raw/wiki.test.raw
at chunk 9 such as:
./imatrix -ofreq 20 -t 4 -ngl 20 -mg 0 -m models/BagelMix-8x7B.Q8_0.gguf -o imatrix-bagelmix.dat -f wikitext-2-raw/wiki.test.raw
The issue is that there is a matrix multiplication in ffn_moe_down-30
that produces values that cannot be represented in a 16-bit floating point number. This results in inf
values, which then turn into nan
. Forcing the matrix multiplications to run on 32-bit fixes it. You might be able to get a valid imatrix with a LLAMA_CUDA_FORCE_MMQ
build, but ultimately I think this is an issue with the model.
With BagelMix, it happens all the way back to b2060 (Feb 4). Not sure what to test next - it seems to affect only cuda, in practically all versions.
@slaren ah, wow, thanks for tracking it down! That's probably why it works on the cpu then. Whats strange is that it seems to affect a lot of models that otherwise seem to work fine (and a lot of models) - but maybe it's just not detected during inferencing.
LLAMA_CUDA_FORCE_MMQ seems to work around this indeed, at no discernible speed loss in my config, too.
In the meantime, I have added this to catch this specific problem earlier, maybe it would be a good and cheap idea to add sth. like it:
diff --git a/examples/imatrix/imatrix.cpp b/examples/imatrix/imatrix.cpp
index f21bc48f..322e3235 100644
--- a/examples/imatrix/imatrix.cpp
+++ b/examples/imatrix/imatrix.cpp
@@ -440,6 +440,9 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
printf("[%d]%.4lf,", i + 1, std::exp(nll / count));
fflush(stdout);
+ if (std::isnan (std::exp(nll / count)))
+ abort ();
+
logits.clear();
}
}
For the next reader sufferting form these problems: LLAMA_CUDA_FORCE_MMQ does not, unfortunately, work in all cases (example model: Meidebenne-120b-v1.0)
And an anecdotal statistic: at the moment, roughly a third of the models I quantize on huggingface either have trouble during imatrix generation or later during IQ1_S or other quants.
Just wanted to add that this still affects a large number of models - almost half of the llama-3 models i quantize can't generate iq3_xxs or other i-quants, without nans during imatrix generation.
Recent example an imatrix: https://huggingface.co/mradermacher/llama-3-70B-instruct-uncensored-i1-GGUF
This issue was closed because it has been inactive for 14 days since being marked as stale.