llama.cpp Eval bug: llama-cli, spurious token added to assistant response

Name and Version

version: 5327 (27ebfcac) built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

nvidia

Models

all

Problem description & steps to reproduce

After the user prompt is provided, the code enters this branch: https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L716

No new tokens are generated.

However, the following code assumes that there is a new token and it is inserted in the assistant response:

https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L824

First Bad Commit

No response

Relevant log output

The easiest way is to set a breakpoint here and wait for the assistant message:

https://github.com/ggml-org/llama.cpp/blob/0cf6725e9f9a164c39f7a87214d60342f7f946d8/tools/main/main.cpp#L270

May 09 '25 11:05 matteoserva

I noticed this char before, I always just assumed it was a spurious prompt print (since most templates end with >, but I see now that it's repeating the last processed token of the template.

May 09 '25 12:05 CISC

Hi Is the bug fixed? If not, Can I pick it up

Aug 22 '25 08:08 BashCache

Hi Is the bug fixed? If not, Can I pick it up

AFAIK no, please do.

Aug 22 '25 08:08 CISC