llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

GGML_ASSERT: ../llama.cpp/ggml-quants.c:10340: grid_index >= 0

Open schmorp opened this issue 11 months ago • 17 comments

I am converting many models to gguf, and (imatrix-) quantize them afterwards. Today, all my jobs failed with the error from the title (many output lines, probably one per thread).

This might be incidental, but I also upgraded from a checkout from March 6th to b2405.

Example model:

https://huggingface.co/Sao10K/Solstice-Mixtral-v1

imatrix file: https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF/blob/main/imatrix.dat

Failing invocation:

quantize --allow-requantize --leave-output-tensor --imatrix imatrix.dat Mythical-Destroyer-V2-L2-13B.gguf IQ2_M

https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF

The normal Q-quants quantized fine (see https://huggingface.co/mradermacher/Solstice-Mixtral-v1-i1-GGUF), it's only when it moved to IQ2_M that it fails (that's the first I-quant my script generates).

schmorp avatar Mar 12 '24 14:03 schmorp

Actually, by now, only 5 of the 8 jobs failed, so it clearly depends on the model and is not universal.

schmorp avatar Mar 12 '24 16:03 schmorp

Using your imatrix and the vanilla Mixtral Instruct 0.1 also asserts on M2 Ultra. If I use the imatrix from https://huggingface.co/datasets/ikawrakow/imatrix-from-wiki-train/blob/main/mixtral-8x7b-instruct-v0.1.imatrix it does not assert. Not sure what this means, but providing this data point to help investigate

ggerganov avatar Mar 12 '24 16:03 ggerganov

Just noticed this during another imatrix calculation:

[20]17.0770,[21]17.3345,[22]17.2957,[23]17.5153,[24]17.4855,[25]17.4871,[26]17.4550,[27]17.4515,[28]17.3787,[29]17.2030,[30]17.1163,[31]16.9776,[32]16.9705,[33]16.8447,[34]16.7977,[35]16.6126,[36]16.4577,[37]16.3338,[38]16.1736,[39]16.2237, save_imatrix: stored collected data after 40 chunks in miqu-1-120b.imatrix~ [40]16.3695,[41]16.2985,[42]16.2147,[43]16.1276,[44]16.1080,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan

Could be incidental, but I don't think I have noticed "nan" values with the March 6 llama.cpp I used before. But maybe I did and maybe its normal. I will report if miqu-1-120b fails or not, and if this has anything to do with it.

schmorp avatar Mar 13 '24 12:03 schmorp

It should never produce nan - this is either a bug or the input data is invalid

ggerganov avatar Mar 13 '24 17:03 ggerganov

miqu-1-120b failed the same way.

I'm quantizing models for quite a while now, many miqu variants made the same way. It's very unlikely that I just happened ot pick invalid models in the last few days, and the hundred or so models before happened to be fine. This is something that must have been introduced between March 6 and b2405, somehow. Or an incompatiblity between cuda 12.3 and 12.4, as that was also upgraded.

Since considerable work and computing time has been spent on this and potentially needs to be redone, it would be nice if somebody could shed some light on what that error means, and if its the imatrix that is broken (and needs to be redone) or the quantize process using it (and whether the non-I-quants are affected and need to be redone - this probably affects about 60 repositories by now). I will try to downgrade to the march 6 version I used before (if I can identify the exact release) and will try to redo an imatrix, to see if the nan problem goes away.

schmorp avatar Mar 13 '24 21:03 schmorp

save_imatrix: stored collected data after 40 chunks in miqu-1-120b.imatrix~ [40]16.3695,[41]16.2985,[42]16.2147,[43]16.1276,[44]16.1080,[45]15.9874,[46]16.0110,[47]15.9119,

nan's do not appear with b2355.

schmorp avatar Mar 13 '24 22:03 schmorp

Can you share specific instructions to reproduce the nan issue? Ideally with the smallest model that you are aware that has the issue. Or run a git bisect to find the commit that introduced the issue.

slaren avatar Mar 13 '24 22:03 slaren

With the Q8 quant from here: https://huggingface.co/mradermacher/BagelMix-8x7B-GGUF

And this invocation: imatrix -ofreq 20 -t 4 -ngl 0 -mg 0 -m ... -o ... -f ...

I reliably get nan after 5 chunks, on two different computers (to exclude a hardware problem). I cannot share the training data, unfortunately :( I'll try to replicate it with something else, but I can't replicate it with e.g. group_10_merged.txt, so it might be somehow training-data dependent (although it does not happen at the same chunk for other models).

schmorp avatar Mar 14 '24 02:03 schmorp

Actually, I wrote this when the test with group_10_merged.txt was at the end, but it did give a nan result near the end with (https://github.com/ggerganov/llama.cpp/files/14143306/group_10_merged.txt btw.)

schmorp avatar Mar 14 '24 02:03 schmorp

The BagelMix nan happens with b2355 as well. And the nan issue correlatres with the crash at least in one case (miqu-1-120b). Also, the same imatrix fails the same way with cuda 12.3 and 12.4.

To summarize my findings (because this ticket is a bit confusing).

The nan issue happens deterministically, i.e. with the same imatrix binary, model and training data I get nan in the same chunk each time. It happens on my main workstation with either my rtx 4090, rtx 4070, or on my other machine with an rtx 4060, so it is not a hardware problem.

It does NOT happen with "CUDA_VISIBLE_DEVICES=" so it seems to be a cuda issue.

It happens with the Mar 6 (b2355) version as well, but in a different chunk (or maybe it is not deterministic with miqu-1-120b, because that worked with Mar 6 but not with the Mar 11 binary, one test only).

I have done many imatrix quants in the last two months with the same imatrix input, and it only started to happen recently - either because I was unlucky with models, or because something changed recently (it is possible, but not likely, that quantize started to crash only recently and the nan issue is older - I have not checked the output of imatrix runs). I have been wrong about it being the Mar 6 to Mar 11 upgrade, because miqu-1-120b failed with the mar 11 version but worked with the mar 6 one, but BagelMix fails with the Mar 6 version as well.

schmorp avatar Mar 14 '24 10:03 schmorp

I was able to reproduce the issue with wikitext-2-raw/wiki.test.raw at chunk 9 such as:

./imatrix -ofreq 20 -t 4 -ngl 20 -mg 0 -m models/BagelMix-8x7B.Q8_0.gguf -o imatrix-bagelmix.dat -f wikitext-2-raw/wiki.test.raw 

The issue is that there is a matrix multiplication in ffn_moe_down-30 that produces values that cannot be represented in a 16-bit floating point number. This results in inf values, which then turn into nan. Forcing the matrix multiplications to run on 32-bit fixes it. You might be able to get a valid imatrix with a LLAMA_CUDA_FORCE_MMQ build, but ultimately I think this is an issue with the model.

slaren avatar Mar 14 '24 12:03 slaren

With BagelMix, it happens all the way back to b2060 (Feb 4). Not sure what to test next - it seems to affect only cuda, in practically all versions.

schmorp avatar Mar 14 '24 12:03 schmorp

@slaren ah, wow, thanks for tracking it down! That's probably why it works on the cpu then. Whats strange is that it seems to affect a lot of models that otherwise seem to work fine (and a lot of models) - but maybe it's just not detected during inferencing.

schmorp avatar Mar 14 '24 12:03 schmorp

LLAMA_CUDA_FORCE_MMQ seems to work around this indeed, at no discernible speed loss in my config, too.

In the meantime, I have added this to catch this specific problem earlier, maybe it would be a good and cheap idea to add sth. like it:

diff --git a/examples/imatrix/imatrix.cpp b/examples/imatrix/imatrix.cpp
index f21bc48f..322e3235 100644
--- a/examples/imatrix/imatrix.cpp
+++ b/examples/imatrix/imatrix.cpp
@@ -440,6 +440,9 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
             printf("[%d]%.4lf,", i + 1, std::exp(nll / count));
             fflush(stdout);
 
+            if (std::isnan (std::exp(nll / count)))
+               abort ();
+
             logits.clear();
         }
     }

schmorp avatar Mar 14 '24 13:03 schmorp

For the next reader sufferting form these problems: LLAMA_CUDA_FORCE_MMQ does not, unfortunately, work in all cases (example model: Meidebenne-120b-v1.0)

schmorp avatar Mar 15 '24 21:03 schmorp

And an anecdotal statistic: at the moment, roughly a third of the models I quantize on huggingface either have trouble during imatrix generation or later during IQ1_S or other quants.

schmorp avatar Mar 15 '24 22:03 schmorp

Just wanted to add that this still affects a large number of models - almost half of the llama-3 models i quantize can't generate iq3_xxs or other i-quants, without nans during imatrix generation.

Recent example an imatrix: https://huggingface.co/mradermacher/llama-3-70B-instruct-uncensored-i1-GGUF

schmorp avatar Apr 23 '24 04:04 schmorp

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 08 '24 01:06 github-actions[bot]