llama.cpp
llama.cpp copied to clipboard
Eval bug: trivial grammar crashes (DeepSeek R1 Distill Llama 8B)
Name and Version
latest
Operating systems
No response
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
llama-cli -hf bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M --grammar 'root ::= "{"' -p hey -no-cnv
Problem description & steps to reproduce
With the following extremely simple grammar somehow at the time we reach the grammar sampler, there's only 1 candidate (@) and it hard crashes.
First Bad Commit
cc/ @ggerganov could this be related to any recent refactoring? (~~https://github.com/ggerganov/llama.cpp/pull/10803 maybe?~~ I'll try and bissect)
Relevant log output
hey/tmp/llama.cpp-20250131-5280-k2rjfn/src/llama-grammar.cpp:1216: GGML_ASSERT(!grammar.stacks.empty()) failed
Tried w/ --samplers "" and this time it's crashing on <|reserved_special_token_247|>, which I'm not sure should have made it this far (maybe wrong token type in the GGUF?).
Regardless, will probably replace the GGML_ASSERT(!grammar.stacks.empty()) with:
if (grammar.stacks.empty()) {
throw std::runtime_error("Unexpected empty grammar stack after accepting piece: " + piece);
}
Seems specific to DeepSeek-R1-Distill-Llama-8B-GGUF (the Qwen 7B & 32B distills don't crash with that grammar)
It does not crash on my end:
llama-cli --version
version: 4570 (6e84b0ab)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama-cli -hf bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M --grammar 'root ::= "{"' -p hey -no-cnv
0.01.168.912 I system_info: n_threads = 16 (n_threads_batch = 16) / 24 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
0.01.168.912 I
0.01.169.188 I sampler seed: 2240775377
0.01.169.197 I sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
0.01.169.200 I sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
0.01.169.200 I generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
0.01.169.202 I
hey{ [end of text]
0.01.218.866 I llama_perf_sampler_print: sampling time = 0.33 ms / 4 runs ( 0.08 ms per token, 12232.42 tokens per second)
0.01.218.877 I llama_perf_context_print: load time = 453.76 ms
0.01.218.880 I llama_perf_context_print: prompt eval time = 19.98 ms / 2 tokens ( 9.99 ms per token, 100.12 tokens per second)
0.01.218.882 I llama_perf_context_print: eval time = 12.88 ms / 1 runs ( 12.88 ms per token, 77.67 tokens per second)
0.01.218.883 I llama_perf_context_print: total time = 78.57 ms / 3 tokens
0.01.220.078 I ggml_metal_free: deallocating
Oh, mine crashes w/ the following versions:
./build/bin/llama-cli --version
version: 4617 (90517ec4)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama-cli --version # homebrew
version: 4606 (a83f5286)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
fwiw, I'm getting this same exception when calling llama_sampler_init_grammar_lazy from llamasharp consistently with any of the DeepSeek distills. It produces the think tags as expected, hits the triggerWords of and then immediately fails. However, I add a grammar via llama_sampler_init_grammar then no problem, just need to adjust the gbnf to account for the thinking tags. I believe llamasharp is built against 4620
@phil-scott-78 thanks for reporting!
Note that the Qwen distills should get better generally with https://github.com/ggerganov/llama.cpp/pull/11607 (although no changes related to grammar), and another possible thing might be the double bos situation (addressing in https://github.com/ggerganov/llama.cpp/pull/11616 ). Hope to circle back to this in a couple of days.
right on. For what it's worth, I tried again with lazy grammar with Mistral-Small-24B-Instruct-2501. Gave it a prompt to include its thinking to force the issue. Same thing, output its thinking, got to the </think> and blew up on the assert when it came time to do the grammar. All this is on 4620 though. I'll try and reproduce with llama.cpp when I get a chance though. Don't want to be chasing ghosts already resolved because of that project lagging a bit.
Found at least one issue: if a token contains or completes a trigger and adds text that can't be parsed by the grammar, then kaboom (came up while testing upcoming changes that add even more triggers (ref); testing possible fixes).
In any case, the issue reported in this bug seems to work for me now, probably because of ~~https://github.com/ggerganov/llama.cpp/pull/11616~~ (edit) https://github.com/ggerganov/llama.cpp/pull/11607
Will close this as I can't repro the original issue, please feel free to open a new one if you still experience problems!