llama.cpp Feature Request: Prefix assistant answer

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Mistral's API allows to prefix the answer of the assistant with a specified string. Excerpt from the documentation:

    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": question},
        {"role": "assistant", "content": prefix, "prefix": True}, # <------- this line here is new
    ],

This makes it so that the next answer by the assistant starts with the given prefix.

Motivation

The option to prefix the assistant's prompt gives a great deal of control over the generation of the model while being much simpler to use than the alternatives.

For example, to force the model to answer directly with code in Java with a specific function signature, the prefix could be "```java\nint add(int x, int y){". This technique is used to generate code for benchmarks such as HumanEval to prevent the models from going of the rails.

Possible Implementation

A full usage example could look something like this:

# Example to generate a function named "quacksort".
# Currently, llama-server ignores the prefix and generates "quicksort" instead.
import requests

def does_not_work_yet():
    url = "http://localhost:8080/v1/chat/completions"

    prefix = "```go\nfunc quacksort"

    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
            {"role": "assistant", "content": prefix, "prefix": True}, # <----- this line here is new
        ],
        "seed": 0,
    }

    with requests.post(url, json=data) as response:
        content = response.json()["choices"][0]["message"]["content"]

    print(content)

if __name__ == "__main__":
    does_not_work_yet()

(I used the qwen2.5-coder-7b-instruct-q3_k_m model: llama-server --model qwen2.5-coder-7b-instruct-q3_k_m.gguf --host 127.0.0.1 --port 8080)

The expected result can be obtained with the raw completion API, but this is not portable from model to model since it requires knowledge of the prompt format, is more complicated and generally error prone since a single misplaced white space or line break can have significant impact on the generation quality.


import requests

def works_but_ugly():
    url = "http://localhost:8080/completion"

    prefix = "```go\nfunc quacksort"

    prompt = f"""<|im_start|>system
Only provide code. Do not write explanations.<|im_end|>
<|im_start|>user
Implement quicksort.<|im_end|>
<|im_start|>assistant
{prefix}"""

    data = {
        "prompt": prompt,
        "seed": 0,
    }

    with requests.post(url, json=data) as response:
        content = prefix + response.json()["content"]

    print(content)

if __name__ == "__main__":
    works_but_ugly()

Jan 31 '25 06:01 99991

right now the workaround is to use the new /apply-template endpoint in llama-server, added in a recent commit. It's explained here: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-apply-template-apply-chat-template-to-a-conversation

Jan 31 '25 13:01 matteoserva

right now the workaround is to use the new /apply-template endpoint in llama-server, added in a recent commit. It's explained here: https://github.com/ggerganov/llama.cpp/tree/master/examples/server#post-apply-template-apply-chat-template-to-a-conversation

Great! With this new /apply-template endpoint, we are already half-way there.

Is there an equivalent /parse-template endpoint to convert the raw chat template string back to JSON?

import requests

def apply_template():
    url = "http://localhost:8080/apply-template"

    prefix = "```go\nfunc quacksort"

    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
        ],
    }

    with requests.post(url, json=data) as response:
        prompt = response.json()["prompt"]

    data = {
        "prompt": prompt + prefix,
        "seed": 0,
    }

    url = "http://localhost:8080/completion"

    with requests.post(url, json=data) as response:
        content = prefix + response.json()["content"]

    print(content)

if __name__ == "__main__":
    apply_template()

Jan 31 '25 13:01 99991

The templating system used by the models doesn't support parsing. It's not llama.cpp's fault. Anyway, you can put your answer back in your messages array

import requests

def perform_inference(messages, prefix):
    url = "http://localhost:8080/apply-template"

    data =  {
        "messages": messages
    }

    with requests.post(url, json=data) as response:
        prompt = response.json()["prompt"]

    data = {
        "prompt": prompt + prefix,
        "seed": 0,
    }

    url = "http://localhost:8080/completion"

    with requests.post(url, json=data) as response:
        content = prefix + response.json()["content"]

    messages =  messages + [{"role": "assistant", "content":content}]
    return messages

if __name__ == "__main__":
    messages = [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
        ]
    prefix = "```go\nfunc quacksort"
    updated_messages = perform_inference(messages, prefix)
    print(updated_messages)

Jan 31 '25 14:01 matteoserva

+1 for this - not supporting prefix in /v1/chat/completion for me is the largest gap between llama.cpp vs common API providers & lmstudio...

Feb 06 '25 15:02 Dango233

The feature already exists in the form of custom GBNF grammars! You can use the custom GBNF as grammar parameter in a server completion requests or in the --grammer or --grammar-file command line option. An example grammar file is: root ::= "```go\nfunc quacksort" .*

Feb 07 '25 16:02 hdu-hh

The feature already exists in the form of custom GBNF grammars!

Great! It works!

import requests

url = "http://localhost:8080/v1/chat/completions"

def prefix_using_grammar():
    prefix = "```go\nfunc quacksort"

    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
        ],
        "grammar": f'root ::= "{prefix}" .*', # <---------- this line here is new
        "seed": 0,
    }

    with requests.post(url, json=data) as response:
        content = response.json()["choices"][0]["message"]["content"]
    print(content)

if __name__ == "__main__":
    prefix_using_grammar()

All that is required is to add the grammar to the data object:

data = {
    ...
    "grammar": f'root ::= "{prefix}" .*',
}

For me, this is good enough, but I wonder whether "prefix": True should be implemented anyway to have API compatibility with Mistral.

EDIT: I tested this a bit and I think there is an optimization missing: Sequences of consecutive tokens which are uniquely determined should be batch-computed. The performance makes me think that they are computed sequentially.

Feb 07 '25 16:02 99991

This issue was closed because it has been inactive for 14 days since being marked as stale.

Mar 24 '25 01:03 github-actions[bot]

@ggerganov Could you please reopen this issue? The grammar-workaround works, but a more efficient solution is possible.

Mar 24 '25 08:03 99991

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 09 '25 01:05 github-actions[bot]

this is solved by https://github.com/ggml-org/llama.cpp/pull/13174

May 09 '25 05:05 matteoserva

this is solved by #13174

~~Do you have an example how to use this? I can only see an example for /apply-template.~~

EDIT: It seems like assistant answers are automatically completed now. I think the Mistral API with an additional "prefix": True key in the message is better because it is more explicit about what should happen. The current API does not allow to generate a user response after an assistant response without also completing the assistant response.

Does https://github.com/ggml-org/llama.cpp/pull/13174 use token healing? I get different results compared to the grammar approach with Qwen2.5-Coder-7B-Instruct (additional space after prefix). With Gemma-27B, I even get incorrect results (no colon generated after function header).

import requests

url = "http://localhost:8080/v1/chat/completions"

prefix = "```def quicksort(values)"

def correct_prefix_using_grammar():
    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
        ],
        "grammar": f'root ::= "{prefix}" .*',
        "seed": 0,
        "max_tokens": 256,
    }

    with requests.post(url, json=data) as response:
        content = response.json()["choices"][0]["message"]["content"]
    print(content)

def incorrect_prefix_with_automatic_assistant_completion():
    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
            {"role": "assistant", "content": prefix}, # <--- this gets completed automatically if the role is "assistant"
        ],
        "seed": 0,
        "max_tokens": 256,
    }

    with requests.post(url, json=data) as response:
        content = response.json()["choices"][0]["message"]["content"]
    print(prefix + content)

print("Correct:")
correct_prefix_using_grammar()
print("#" * 80)
print("Incorrect:")
incorrect_prefix_with_automatic_assistant_completion()

Output:

Correct:
```def quicksort(values):
    if len(values) <= 1:
        return values
    else:
        pivot = values.pop()
        less_than_pivot = [x for x in values if x <= pivot]
        greater_than_pivot = [x for x in values if x > pivot]
        return quicksort(less_than_pivot) + [pivot] + quicksort(greater_than_pivot)
```
################################################################################
Incorrect:
```def quicksort(values) -> list:
    if len(values) <= 1:
        return values
    else:
        pivot = values.pop()
        less_than_pivot = [x for x in values if x <= pivot]
        greater_than_pivot = [x for x in values if x > pivot]
        return quicksort(less_than_pivot) + [pivot] + quicksort(greater_than_pivot)
```

May 09 '25 08:05 99991