Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue
Name and Version
>llama-server --version
version: 4689 (90e4dba4)
built with MSVC 19.42.34436.0 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa
Problem description & steps to reproduce
Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API. It keeps returning the "Either "json_schema" or "grammar" can be specified, but not both" error message.
I've tried using several different models from HF, and this issue happens no matter which model I loaded. The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B
Request:
curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}'
Response:
{
"error": {
"code": 400,
"message": "Either \"json_schema\" or \"grammar\" can be specified, but not both",
"type": "invalid_request_error"
}
}
I've tried changing the response_format with various values like below but it keeps returning that same error.
"response_format": {
"type": "json_schema", // either "json_schema" or "json_object" shows the same error
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
"response_format": {
"type": "json_schema", // either "json_schema" or "json_object" shows the same error
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
Even using the one in the documentation ({"type": "json_object"}) returns the same error:
{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {"type": "json_object"}
}
To add, I tried using the POST /completion API and using the same GGUF model it's capable of returning using the defined JSON schema:
Request:
curl --location 'http://localhost:1234/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"prompt": "<|im_start|>user\nhello<|im_end|>",
"json_schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}'
Response:
{
"index": 0,
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}",
"tokens": [],
"id_slot": 0,
"stop": true,
"model": "hermes-3-llama-3.1-8b",
"tokens_predicted": 17,
"tokens_evaluated": 6,
"generation_settings": {
"n_predict": -1,
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"dry_multiplier": 0.0,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 4096,
"dry_sequence_breakers": [
"\n",
":",
"\"",
"*"
],
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"stop": [],
"max_tokens": -1,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nresponse-kv ::= \"\\\"response\\\"\" space \":\" space string\nroot ::= \"{\" space response-kv \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
"grammar_trigger_words": [],
"grammar_trigger_tokens": [],
"preserved_tokens": [],
"samplers": [
"penalties",
"dry",
"top_k",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
],
"speculative.n_max": 16,
"speculative.n_min": 5,
"speculative.p_min": 0.8999999761581421,
"timings_per_token": false,
"post_sampling_probs": false,
"lora": []
},
"prompt": "<|begin_of_text|><|im_start|>user\nhello<|im_end|>",
"has_new_line": true,
"truncated": false,
"stop_type": "eos",
"stopping_word": "",
"tokens_cached": 22,
"timings": {
"prompt_n": 6,
"prompt_ms": 1098.932,
"prompt_per_token_ms": 183.15533333333335,
"prompt_per_second": 5.459846469117288,
"predicted_n": 17,
"predicted_ms": 7322.017,
"predicted_per_token_ms": 430.7068823529412,
"predicted_per_second": 2.3217646175910276
}
}
First Bad Commit
No response
Relevant log output
Can you try this without using the --jinja flag when starting the server?
Without the --jinja flag it seems to work.
Request
curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}",
"tool_calls": null,
"role": "assistant"
}
}
],
"created": 1739497907,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4689-90e4dba4",
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": 10,
"total_tokens": 27
},
"id": "chatcmpl-NSUouLxi9bvdjdnqwh7o1DLc4UmoCzsV",
"timings": {
"prompt_n": 1,
"prompt_ms": 624.063,
"prompt_per_token_ms": 624.063,
"prompt_per_second": 1.6024023215604835,
"predicted_n": 17,
"predicted_ms": 10365.442,
"predicted_per_token_ms": 609.7318823529412,
"predicted_per_second": 1.6400651318101052
}
}
However, without --jinja now I can't include any tools in the request.
Request:
curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
},
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
]
}'
Response:
{
"error": {
"code": 500,
"message": "tools param requires --jinja flag",
"type": "server_error"
}
}
Is there anyway to use both functionality with the OpenAI compatible chat completion API?
Is there anyway to use both functionality with the OpenAI compatible chat completion API?
I think this might be a bug and I'm looking into this.
If we take a look at this request processing on the server we can look at this handler:
const auto handle_chat_completions = [&ctx_server, ¶ms, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
LOG_DBG("request: %s\n", req.body.c_str());
if (ctx_server.params_base.embedding) {
res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
return;
}
auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);
return handle_completions_impl(
SERVER_TASK_TYPE_COMPLETION,
data,
req.is_connection_closed,
res,
OAICOMPAT_TYPE_CHAT);
};
We can inspect the body from the request:
(gdb) pjson body
{
"model": "llama-2-7b-chat",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}
And this looks good and there is no grammar attribute in the body.
Next we have the call to:
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);
And if we inspect the data after this call we do see the grammar attribute:
(gdb) pjson data | shell jq
{
"stop": [],
"json_schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
},
"chat_format": 1,
"prompt": "<|im_start|>system\nRespond in JSON format, either with `tool_call` (a request to call tools) or with `response` reply to the user's request<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n",
"grammar": "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \"{\" space alternative-1-response-kv \"}\" space\nalternative-1-response ::= \"{\" space alternative-1-response-response-kv \"}\" space\nalternative-1-response-kv ::= \"\\\"response\\\"\" space \":\" space alternative-1-response\nalternative-1-response-response-kv ::= \"\\\"response\\\"\" space \":\" space string\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nroot ::= alternative-0 | alternative-1\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
"grammar_lazy": false,
"grammar_triggers": [],
"preserved_tokens": [],
"model": "llama-2-7b-chat",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}
If we look in oaicompat_completion_params_parse we can see the following:
// Apply chat template to the list of messages
if (use_jinja) {
...
// TODO: support mixing schema w/ tools beyond generic format.
inputs.json_schema = json_value(llama_params, "json_schema", json());
auto chat_params = common_chat_params_init(tmpl, inputs);
llama_params["chat_format"] = static_cast<int>(chat_params.format);
llama_params["prompt"] = chat_params.prompt;
llama_params["grammar"] = chat_params.grammar;
llama_params["grammar_lazy"] = chat_params.grammar_lazy;
auto grammar_triggers = json::array();
for (const auto & trigger : chat_params.grammar_triggers) {
grammar_triggers.push_back({
{"word", trigger.word},
{"at_start", trigger.at_start},
});
}
llama_params["grammar_triggers"] = grammar_triggers;
And if we inspect the chat_params we can see that the grammar attribute is
there:
(gdb) p chat_params.grammar
$2 = "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \""...
Perhaps the grammar should be conditioned on the json_schema:
if (inputs.json_schema == nullptr) {
llama_params["grammar"] = chat_params.grammar;
llama_params["grammar_lazy"] = chat_params.grammar_lazy;
auto grammar_triggers = json::array();
for (const auto & trigger : chat_params.grammar_triggers) {
grammar_triggers.push_back({
{"word", trigger.word},
{"at_start", trigger.at_start},
});
}
llama_params["grammar_triggers"] = grammar_triggers;
}
I haven't gone through this code before so I'm unsure if this is the correct thing to do but I'll open a PR with this suggestion and perhaps others can weigh in on this.
@danbev I tried building llama.cpp on your branch locally and tested it, but it seems now neither the tools or response format is ignored by the model if we use the --jinja flag.
I am using the same https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B model for all these, and server is running with this command (except one sample):
llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa
If I add both response_format and tools:
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
},
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
]
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The current temperature in New York is 68°F (20°C) with partly cloudy skies. The wind is blowing at 6 mph with a humidity of 50%. It feels like 65°F (18°C)."
}
}
],
"created": 1739758241,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4714-1c9bd941",
"object": "chat.completion",
"usage": {
"completion_tokens": 53,
"prompt_tokens": 242,
"total_tokens": 295
},
"id": "chatcmpl-QaAlEpc4cvVciz3MDcJFg5bfWN2XyZYT",
"timings": {
"prompt_n": 1,
"prompt_ms": 569.15,
"prompt_per_token_ms": 569.15,
"prompt_per_second": 1.7570060616709129,
"predicted_n": 53,
"predicted_ms": 25060.5,
"predicted_per_token_ms": 472.83962264150944,
"predicted_per_second": 2.1148819855948604
}
}
The model just hallucinates and responded without calling the tool.
If I only add response_format (with --jinja flag added)
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
"created": 1739758800,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4714-1c9bd941",
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": 44,
"total_tokens": 61
},
"id": "chatcmpl-xNEddsFUUKL6dKhNc1CLqw8fMTBWTPBt",
"timings": {
"prompt_n": 1,
"prompt_ms": 464.608,
"prompt_per_token_ms": 464.608,
"prompt_per_second": 2.1523520903643503,
"predicted_n": 17,
"predicted_ms": 7200.607,
"predicted_per_token_ms": 423.5651176470588,
"predicted_per_second": 2.36091207310717
}
}
The model obviously can't call any tools, but response IS NOT using the requested format.
If I only add response_format (without --jinja flag added)
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{ \"response\": \"Hello! How can I assist you today?\" }"
}
}
],
"created": 1739759039,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4714-1c9bd941",
"object": "chat.completion",
"usage": {
"completion_tokens": 16,
"prompt_tokens": 10,
"total_tokens": 26
},
"id": "chatcmpl-WJfSYeUk6cytAZCq1uWQ38VPkyw921Nh",
"timings": {
"prompt_n": 10,
"prompt_ms": 1666.412,
"prompt_per_token_ms": 166.6412,
"prompt_per_second": 6.000916940108448,
"predicted_n": 16,
"predicted_ms": 7331.583,
"predicted_per_token_ms": 458.2239375,
"predicted_per_second": 2.1823390664744573
}
}
The model obviously can't call any tools, but response IS using the requested format.
If I only add tools
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
]
}'
Response:
{
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"New York, USA\"}"
},
"id": ""
}
]
}
}
],
"created": 1739758968,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4714-1c9bd941",
"object": "chat.completion",
"usage": {
"completion_tokens": 36,
"prompt_tokens": 242,
"total_tokens": 278
},
"id": "chatcmpl-HFubvobm8fmhG1rBX9XXL3la1fBju80h",
"timings": {
"prompt_n": 209,
"prompt_ms": 29640.877,
"prompt_per_token_ms": 141.82237799043062,
"prompt_per_second": 7.0510734213431,
"predicted_n": 36,
"predicted_ms": 16525.598,
"predicted_per_token_ms": 459.04438888888893,
"predicted_per_second": 2.178438565430431
}
}
The model can perform the tool call request.
@tulang3587 Sorry but I think I might have mislead you, and the "fix" I proposed above does not seem to be correct. However there is an open PR which mentions this:
Fixed & tested --jinja w/o tool call w/ grammar or json_schema
~I've not had time to try it out yet, but if we can would you be able to see if that addresses your issue?~ I tried out the PR and the original issue "Either "json_schema" or "grammar" can be specified, but not both" is not longer present.
Not related to this issue but I noticed that there is a specific template for the model you are using which might be useful:
--chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja
@danbev I see the PR has been merged to master and I just tested the latest build (https://github.com/ggml-org/llama.cpp/releases/tag/b4739). It looks good so far, let me close this issue.
llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa
If tool is needed, the model is returning the tool_call as expected.
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"New York, USA\"}"
},
"id": ""
}
]
}
}
],
"created": 1739930518,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 36,
"prompt_tokens": 236,
"total_tokens": 272
},
"id": "chatcmpl-TE1CyOCGjIUhugis6urAnjhkEPdMTacw",
"timings": {
"prompt_n": 17,
"prompt_ms": 2903.262,
"prompt_per_token_ms": 170.78011764705883,
"prompt_per_second": 5.855482557206342,
"predicted_n": 36,
"predicted_ms": 16253.202,
"predicted_per_token_ms": 451.4778333333333,
"predicted_per_second": 2.21494816836707
}
}
If no tool is needed, response_format works fine.
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}"
}
}
],
"created": 1739930409,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 24,
"prompt_tokens": 224,
"total_tokens": 248
},
"id": "chatcmpl-cC7Wl3HRlTPJhhba3Lwvmp9bUmP9Fv9t",
"timings": {
"prompt_n": 222,
"prompt_ms": 30874.58,
"prompt_per_token_ms": 139.07468468468468,
"prompt_per_second": 7.190381213282901,
"predicted_n": 24,
"predicted_ms": 10396.267,
"predicted_per_token_ms": 433.1777916666667,
"predicted_per_second": 2.3085209335235426
}
}
If no tool is given, response_format is also working fine.
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's a question, a task you need assistance with, or just general conversation, I'm here to help in any way I can. Don't hesitate to let me know what's on your mind!\"\n}"
}
}
],
"created": 1739930309,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 83,
"prompt_tokens": 10,
"total_tokens": 93
},
"id": "chatcmpl-FfBNMp6abV6hlXUIaXd8DU7WmN0gOEcX",
"timings": {
"prompt_n": 10,
"prompt_ms": 1561.899,
"prompt_per_token_ms": 156.1899,
"prompt_per_second": 6.402462643231093,
"predicted_n": 83,
"predicted_ms": 36502.426,
"predicted_per_token_ms": 439.788265060241,
"predicted_per_second": 2.2738214714824707
}
}
Without the --jinja flag response_format is also working fine.
Request:
curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's general knowledge, specific topics, or creative writing, I'm here to help however I can.\"\n}"
}
}
],
"created": 1739930230,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 63,
"prompt_tokens": 10,
"total_tokens": 73
},
"id": "chatcmpl-ArH0inP8QwjVptveK9qaEKhsBoVJWbfg",
"timings": {
"prompt_n": 1,
"prompt_ms": 565.07,
"prompt_per_token_ms": 565.07,
"prompt_per_second": 1.769692250517635,
"predicted_n": 63,
"predicted_ms": 27290.53,
"predicted_per_token_ms": 433.18301587301585,
"predicted_per_second": 2.3084930926588823
}
}
@danbev I'm not reopening this issue since this works now, but just to note that I think the response_format that can be used isn't exactly matching with OpenAI and llama.cpp documentation.
This makes the model respond in json, but not using the defined schema.
Request:
"response_format": {
"type": "json_object",
"json_schema": {
"name": "something",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"text\": \"Hello! How can I assist you today?\"\n}"
}
}
],
...
}
This just returns standard text.
Request:
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "something",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
...
}
This just returns standard text.
Request:
"response_format": {
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
...
}
This request succeeds with the formatted response:
Request:
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}"
}
}
],
...
}
Yes, in b4739 'response_format' with type = "json_schema" returned unformatted text.