llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: trivial grammar crashes (DeepSeek R1 Distill Llama 8B)

Open ochafik opened this issue 10 months ago • 8 comments

Name and Version

latest

Operating systems

No response

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

llama-cli -hf bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M --grammar 'root ::= "{"' -p hey -no-cnv

Problem description & steps to reproduce

With the following extremely simple grammar somehow at the time we reach the grammar sampler, there's only 1 candidate (@) and it hard crashes.

First Bad Commit

cc/ @ggerganov could this be related to any recent refactoring? (~~https://github.com/ggerganov/llama.cpp/pull/10803 maybe?~~ I'll try and bissect)

Relevant log output

hey/tmp/llama.cpp-20250131-5280-k2rjfn/src/llama-grammar.cpp:1216: GGML_ASSERT(!grammar.stacks.empty()) failed

ochafik avatar Feb 02 '25 11:02 ochafik

Tried w/ --samplers "" and this time it's crashing on <|reserved_special_token_247|>, which I'm not sure should have made it this far (maybe wrong token type in the GGUF?).

Regardless, will probably replace the GGML_ASSERT(!grammar.stacks.empty()) with:

    if (grammar.stacks.empty()) {
        throw std::runtime_error("Unexpected empty grammar stack after accepting piece: " + piece);
    }

ochafik avatar Feb 02 '25 12:02 ochafik

Seems specific to DeepSeek-R1-Distill-Llama-8B-GGUF (the Qwen 7B & 32B distills don't crash with that grammar)

ochafik avatar Feb 02 '25 12:02 ochafik

It does not crash on my end:

llama-cli --version
version: 4570 (6e84b0ab)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

llama-cli -hf bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M --grammar 'root ::= "{"' -p hey -no-cnv

0.01.168.912 I system_info: n_threads = 16 (n_threads_batch = 16) / 24 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 
0.01.168.912 I 
0.01.169.188 I sampler seed: 2240775377
0.01.169.197 I sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
0.01.169.200 I sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
0.01.169.200 I generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
0.01.169.202 I 
hey{ [end of text]


0.01.218.866 I llama_perf_sampler_print:    sampling time =       0.33 ms /     4 runs   (    0.08 ms per token, 12232.42 tokens per second)
0.01.218.877 I llama_perf_context_print:        load time =     453.76 ms
0.01.218.880 I llama_perf_context_print: prompt eval time =      19.98 ms /     2 tokens (    9.99 ms per token,   100.12 tokens per second)
0.01.218.882 I llama_perf_context_print:        eval time =      12.88 ms /     1 runs   (   12.88 ms per token,    77.67 tokens per second)
0.01.218.883 I llama_perf_context_print:       total time =      78.57 ms /     3 tokens
0.01.220.078 I ggml_metal_free: deallocating

ggerganov avatar Feb 02 '25 12:02 ggerganov

Oh, mine crashes w/ the following versions:

./build/bin/llama-cli --version
version: 4617 (90517ec4)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

llama-cli --version  # homebrew
version: 4606 (a83f5286)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

ochafik avatar Feb 02 '25 12:02 ochafik

fwiw, I'm getting this same exception when calling llama_sampler_init_grammar_lazy from llamasharp consistently with any of the DeepSeek distills. It produces the think tags as expected, hits the triggerWords of and then immediately fails. However, I add a grammar via llama_sampler_init_grammar then no problem, just need to adjust the gbnf to account for the thinking tags. I believe llamasharp is built against 4620

phil-scott-78 avatar Feb 02 '25 23:02 phil-scott-78

@phil-scott-78 thanks for reporting!

Note that the Qwen distills should get better generally with https://github.com/ggerganov/llama.cpp/pull/11607 (although no changes related to grammar), and another possible thing might be the double bos situation (addressing in https://github.com/ggerganov/llama.cpp/pull/11616 ). Hope to circle back to this in a couple of days.

ochafik avatar Feb 03 '25 13:02 ochafik

right on. For what it's worth, I tried again with lazy grammar with Mistral-Small-24B-Instruct-2501. Gave it a prompt to include its thinking to force the issue. Same thing, output its thinking, got to the </think> and blew up on the assert when it came time to do the grammar. All this is on 4620 though. I'll try and reproduce with llama.cpp when I get a chance though. Don't want to be chasing ghosts already resolved because of that project lagging a bit.

phil-scott-78 avatar Feb 03 '25 15:02 phil-scott-78

Found at least one issue: if a token contains or completes a trigger and adds text that can't be parsed by the grammar, then kaboom (came up while testing upcoming changes that add even more triggers (ref); testing possible fixes).

In any case, the issue reported in this bug seems to work for me now, probably because of ~~https://github.com/ggerganov/llama.cpp/pull/11616~~ (edit) https://github.com/ggerganov/llama.cpp/pull/11607

ochafik avatar Feb 13 '25 10:02 ochafik

Will close this as I can't repro the original issue, please feel free to open a new one if you still experience problems!

ochafik avatar Feb 25 '25 16:02 ochafik