llama.cpp Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue

Name and Version

>llama-server --version
version: 4689 (90e4dba4)
built with MSVC 19.42.34436.0 for x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa

Problem description & steps to reproduce

Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API. It keeps returning the "Either "json_schema" or "grammar" can be specified, but not both" error message.

I've tried using several different models from HF, and this issue happens no matter which model I loaded. The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

Request:

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "error": {
        "code": 400,
        "message": "Either \"json_schema\" or \"grammar\" can be specified, but not both",
        "type": "invalid_request_error"
    }
}

I've tried changing the response_format with various values like below but it keeps returning that same error.

"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "json_schema": {
        "name": "chat_response",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}

"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Even using the one in the documentation ({"type": "json_object"}) returns the same error:

{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {"type": "json_object"}
}

To add, I tried using the POST /completion API and using the same GGUF model it's capable of returning using the defined JSON schema:

Request:

curl --location 'http://localhost:1234/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "prompt": "<|im_start|>user\nhello<|im_end|>",
    "json_schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}'

Response:

{
    "index": 0,
    "content": "{\n    \"response\": \"Hello! How can I assist you today?\"\n}",
    "tokens": [],
    "id_slot": 0,
    "stop": true,
    "model": "hermes-3-llama-3.1-8b",
    "tokens_predicted": 17,
    "tokens_evaluated": 6,
    "generation_settings": {
        "n_predict": -1,
        "seed": 4294967295,
        "temperature": 0.800000011920929,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "dry_multiplier": 0.0,
        "dry_base": 1.75,
        "dry_allowed_length": 2,
        "dry_penalty_last_n": 4096,
        "dry_sequence_breakers": [
            "\n",
            ":",
            "\"",
            "*"
        ],
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "stop": [],
        "max_tokens": -1,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "logit_bias": [],
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nresponse-kv ::= \"\\\"response\\\"\" space \":\" space string\nroot ::= \"{\" space response-kv \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
        "grammar_trigger_words": [],
        "grammar_trigger_tokens": [],
        "preserved_tokens": [],
        "samplers": [
            "penalties",
            "dry",
            "top_k",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ],
        "speculative.n_max": 16,
        "speculative.n_min": 5,
        "speculative.p_min": 0.8999999761581421,
        "timings_per_token": false,
        "post_sampling_probs": false,
        "lora": []
    },
    "prompt": "<|begin_of_text|><|im_start|>user\nhello<|im_end|>",
    "has_new_line": true,
    "truncated": false,
    "stop_type": "eos",
    "stopping_word": "",
    "tokens_cached": 22,
    "timings": {
        "prompt_n": 6,
        "prompt_ms": 1098.932,
        "prompt_per_token_ms": 183.15533333333335,
        "prompt_per_second": 5.459846469117288,
        "predicted_n": 17,
        "predicted_ms": 7322.017,
        "predicted_per_token_ms": 430.7068823529412,
        "predicted_per_second": 2.3217646175910276
    }
}

First Bad Commit

No response

Relevant log output

Feb 13 '25 11:02 tulang3587

Can you try this without using the --jinja flag when starting the server?

Feb 13 '25 12:02 danbev

Without the --jinja flag it seems to work.

Request

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}",
                "tool_calls": null,
                "role": "assistant"
            }
        }
    ],
    "created": 1739497907,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4689-90e4dba4",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 17,
        "prompt_tokens": 10,
        "total_tokens": 27
    },
    "id": "chatcmpl-NSUouLxi9bvdjdnqwh7o1DLc4UmoCzsV",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 624.063,
        "prompt_per_token_ms": 624.063,
        "prompt_per_second": 1.6024023215604835,
        "predicted_n": 17,
        "predicted_ms": 10365.442,
        "predicted_per_token_ms": 609.7318823529412,
        "predicted_per_second": 1.6400651318101052
    }
}

However, without --jinja now I can't include any tools in the request.

Request:

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    },
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "error": {
        "code": 500,
        "message": "tools param requires --jinja flag",
        "type": "server_error"
    }
}

Is there anyway to use both functionality with the OpenAI compatible chat completion API?

Feb 14 '25 01:02 tulang3587

Is there anyway to use both functionality with the OpenAI compatible chat completion API?

I think this might be a bug and I'm looking into this.

If we take a look at this request processing on the server we can look at this handler:

    const auto handle_chat_completions = [&ctx_server, &params, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
        LOG_DBG("request: %s\n", req.body.c_str());
        if (ctx_server.params_base.embedding) {
            res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
            return;
        }

        auto body = json::parse(req.body);
        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);

        return handle_completions_impl(
            SERVER_TASK_TYPE_COMPLETION,
            data,
            req.is_connection_closed,
            res,
            OAICOMPAT_TYPE_CHAT);
    };

We can inspect the body from the request:

(gdb) pjson body
{
    "model": "llama-2-7b-chat",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}

And this looks good and there is no grammar attribute in the body.

Next we have the call to:

        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);

And if we inspect the data after this call we do see the grammar attribute:

(gdb) pjson data | shell jq
{
    "stop": [],
    "json_schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    },
    "chat_format": 1,
    "prompt": "<|im_start|>system\nRespond in JSON format, either with `tool_call` (a request to call tools) or with `response` reply to the user's request<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n",
    "grammar": "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \"{\" space alternative-1-response-kv \"}\" space\nalternative-1-response ::= \"{\" space alternative-1-response-response-kv \"}\" space\nalternative-1-response-kv ::= \"\\\"response\\\"\" space \":\" space alternative-1-response\nalternative-1-response-response-kv ::= \"\\\"response\\\"\" space \":\" space string\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nroot ::= alternative-0 | alternative-1\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
    "grammar_lazy": false,
    "grammar_triggers": [],
    "preserved_tokens": [],
    "model": "llama-2-7b-chat",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}

If we look in oaicompat_completion_params_parse we can see the following:

    // Apply chat template to the list of messages
    if (use_jinja) {
        ...
        // TODO: support mixing schema w/ tools beyond generic format.
        inputs.json_schema = json_value(llama_params, "json_schema", json());
        auto chat_params = common_chat_params_init(tmpl, inputs);

        llama_params["chat_format"] = static_cast<int>(chat_params.format);
        llama_params["prompt"] = chat_params.prompt;
        llama_params["grammar"] = chat_params.grammar;
        llama_params["grammar_lazy"] = chat_params.grammar_lazy;
        auto grammar_triggers = json::array();
        for (const auto & trigger : chat_params.grammar_triggers) {
            grammar_triggers.push_back({
                {"word", trigger.word},
                {"at_start", trigger.at_start},
            });
        }
        llama_params["grammar_triggers"] = grammar_triggers;

And if we inspect the chat_params we can see that the grammar attribute is there:

(gdb) p chat_params.grammar
$2 = "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \""...

Perhaps the grammar should be conditioned on the json_schema:

        if (inputs.json_schema == nullptr) {
            llama_params["grammar"] = chat_params.grammar;
            llama_params["grammar_lazy"] = chat_params.grammar_lazy;
            auto grammar_triggers = json::array();
            for (const auto & trigger : chat_params.grammar_triggers) {
                grammar_triggers.push_back({
                    {"word", trigger.word},
                    {"at_start", trigger.at_start},
                });
            }
            llama_params["grammar_triggers"] = grammar_triggers;
        }

I haven't gone through this code before so I'm unsure if this is the correct thing to do but I'll open a PR with this suggestion and perhaps others can weigh in on this.

Feb 14 '25 08:02 danbev

@danbev I tried building llama.cpp on your branch locally and tested it, but it seems now neither the tools or response format is ignored by the model if we use the --jinja flag.

I am using the same https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B model for all these, and server is running with this command (except one sample):

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa

If I add both `response_format` and `tools`:

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    },
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The current temperature in New York is 68°F (20°C) with partly cloudy skies. The wind is blowing at 6 mph with a humidity of 50%. It feels like 65°F (18°C)."
            }
        }
    ],
    "created": 1739758241,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 53,
        "prompt_tokens": 242,
        "total_tokens": 295
    },
    "id": "chatcmpl-QaAlEpc4cvVciz3MDcJFg5bfWN2XyZYT",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 569.15,
        "prompt_per_token_ms": 569.15,
        "prompt_per_second": 1.7570060616709129,
        "predicted_n": 53,
        "predicted_ms": 25060.5,
        "predicted_per_token_ms": 472.83962264150944,
        "predicted_per_second": 2.1148819855948604
    }
}

The model just hallucinates and responded without calling the tool.

If I only add `response_format` (with `--jinja` flag added)

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    "created": 1739758800,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 17,
        "prompt_tokens": 44,
        "total_tokens": 61
    },
    "id": "chatcmpl-xNEddsFUUKL6dKhNc1CLqw8fMTBWTPBt",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 464.608,
        "prompt_per_token_ms": 464.608,
        "prompt_per_second": 2.1523520903643503,
        "predicted_n": 17,
        "predicted_ms": 7200.607,
        "predicted_per_token_ms": 423.5651176470588,
        "predicted_per_second": 2.36091207310717
    }
}

The model obviously can't call any tools, but response IS NOT using the requested format.

If I only add `response_format` (without `--jinja` flag added)

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{ \"response\": \"Hello! How can I assist you today?\" }"
            }
        }
    ],
    "created": 1739759039,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 16,
        "prompt_tokens": 10,
        "total_tokens": 26
    },
    "id": "chatcmpl-WJfSYeUk6cytAZCq1uWQ38VPkyw921Nh",
    "timings": {
        "prompt_n": 10,
        "prompt_ms": 1666.412,
        "prompt_per_token_ms": 166.6412,
        "prompt_per_second": 6.000916940108448,
        "predicted_n": 16,
        "predicted_ms": 7331.583,
        "predicted_per_token_ms": 458.2239375,
        "predicted_per_second": 2.1823390664744573
    }
}

The model obviously can't call any tools, but response IS using the requested format.

If I only add `tools`

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "choices": [
        {
            "finish_reason": "tool_calls",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": null,
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\":\"New York, USA\"}"
                        },
                        "id": ""
                    }
                ]
            }
        }
    ],
    "created": 1739758968,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 36,
        "prompt_tokens": 242,
        "total_tokens": 278
    },
    "id": "chatcmpl-HFubvobm8fmhG1rBX9XXL3la1fBju80h",
    "timings": {
        "prompt_n": 209,
        "prompt_ms": 29640.877,
        "prompt_per_token_ms": 141.82237799043062,
        "prompt_per_second": 7.0510734213431,
        "predicted_n": 36,
        "predicted_ms": 16525.598,
        "predicted_per_token_ms": 459.04438888888893,
        "predicted_per_second": 2.178438565430431
    }
}

The model can perform the tool call request.

Feb 17 '25 02:02 tulang3587

@tulang3587 Sorry but I think I might have mislead you, and the "fix" I proposed above does not seem to be correct. However there is an open PR which mentions this:

Fixed & tested --jinja w/o tool call w/ grammar or json_schema

~I've not had time to try it out yet, but if we can would you be able to see if that addresses your issue?~ I tried out the PR and the original issue "Either "json_schema" or "grammar" can be specified, but not both" is not longer present.

Not related to this issue but I noticed that there is a specific template for the model you are using which might be useful:

--chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja

Feb 17 '25 08:02 danbev

@danbev I see the PR has been merged to master and I just tested the latest build (https://github.com/ggml-org/llama.cpp/releases/tag/b4739). It looks good so far, let me close this issue.

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa

If tool is needed, the model is returning the tool_call as expected.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "tool_calls",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": null,
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\":\"New York, USA\"}"
                        },
                        "id": ""
                    }
                ]
            }
        }
    ],
    "created": 1739930518,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 36,
        "prompt_tokens": 236,
        "total_tokens": 272
    },
    "id": "chatcmpl-TE1CyOCGjIUhugis6urAnjhkEPdMTacw",
    "timings": {
        "prompt_n": 17,
        "prompt_ms": 2903.262,
        "prompt_per_token_ms": 170.78011764705883,
        "prompt_per_second": 5.855482557206342,
        "predicted_n": 36,
        "predicted_ms": 16253.202,
        "predicted_per_token_ms": 451.4778333333333,
        "predicted_per_second": 2.21494816836707
    }
}

If no tool is needed, response_format works fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
    "created": 1739930409,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 24,
        "prompt_tokens": 224,
        "total_tokens": 248
    },
    "id": "chatcmpl-cC7Wl3HRlTPJhhba3Lwvmp9bUmP9Fv9t",
    "timings": {
        "prompt_n": 222,
        "prompt_ms": 30874.58,
        "prompt_per_token_ms": 139.07468468468468,
        "prompt_per_second": 7.190381213282901,
        "predicted_n": 24,
        "predicted_ms": 10396.267,
        "predicted_per_token_ms": 433.1777916666667,
        "predicted_per_second": 2.3085209335235426
    }
}

If no tool is given, response_format is also working fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's a question, a task you need assistance with, or just general conversation, I'm here to help in any way I can. Don't hesitate to let me know what's on your mind!\"\n}"
            }
        }
    ],
    "created": 1739930309,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 83,
        "prompt_tokens": 10,
        "total_tokens": 93
    },
    "id": "chatcmpl-FfBNMp6abV6hlXUIaXd8DU7WmN0gOEcX",
    "timings": {
        "prompt_n": 10,
        "prompt_ms": 1561.899,
        "prompt_per_token_ms": 156.1899,
        "prompt_per_second": 6.402462643231093,
        "predicted_n": 83,
        "predicted_ms": 36502.426,
        "predicted_per_token_ms": 439.788265060241,
        "predicted_per_second": 2.2738214714824707
    }
}

Without the --jinja flag response_format is also working fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's general knowledge, specific topics, or creative writing, I'm here to help however I can.\"\n}"
            }
        }
    ],
    "created": 1739930230,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 63,
        "prompt_tokens": 10,
        "total_tokens": 73
    },
    "id": "chatcmpl-ArH0inP8QwjVptveK9qaEKhsBoVJWbfg",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 565.07,
        "prompt_per_token_ms": 565.07,
        "prompt_per_second": 1.769692250517635,
        "predicted_n": 63,
        "predicted_ms": 27290.53,
        "predicted_per_token_ms": 433.18301587301585,
        "predicted_per_second": 2.3084930926588823
    }
}

Feb 19 '25 02:02 tulang3587

@danbev I'm not reopening this issue since this works now, but just to note that I think the response_format that can be used isn't exactly matching with OpenAI and llama.cpp documentation.

This makes the model respond in json, but not using the defined schema.

Request:

"response_format": {
    "type": "json_object",
    "json_schema": {
        "name": "something",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"text\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
    ...
}

This just returns standard text.

Request:

"response_format": {
    "type": "json_schema",
    "json_schema": {
        "name": "something",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    ...
}

This just returns standard text.

Request:

"response_format": {
    "type": "json_schema",
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    ...
}

This request succeeds with the formatted response:

Request:

"response_format": {
    "type": "json_object",
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
   ...
}

Feb 19 '25 09:02 tulang3587

Yes, in b4739 'response_format' with type = "json_schema" returned unformatted text.

Feb 20 '25 14:02 betweenus

Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

If I add both response_format and tools:

If I only add response_format (with --jinja flag added)

If I only add response_format (without --jinja flag added)

If I only add tools

If I add both `response_format` and `tools`:

If I only add `response_format` (with `--jinja` flag added)

If I only add `response_format` (without `--jinja` flag added)

If I only add `tools`