llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

`server`: streaming of tool calls and thoughts when `--jinja` is on

Open ochafik opened this issue 9 months ago • 16 comments

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

  • Support streaming of tool calls in OpenAI format
  • Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
    • Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
    • Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
  • Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
  • Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

This fixes #12107, #10920, #11861

Follow up to https://github.com/ggml-org/llama.cpp/pull/9639

How to test / use

  • Get and build this PR's branch
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-diffs
    cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
    
  • Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

    # Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
    llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF
    
    # Models w/ generic tool call support now return clean interrupted output when hitting token limit
    llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    
    
  • Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ],
      "stream": true
    }'
    
  • You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

  • JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • JSON arguments object w/ name in some prefix (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • No JSON / raw code string for python tool call, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

  • If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'
data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in https://github.com/ggml-org/llama.cpp/pull/11607#issuecomment-2656147148, but relying on control flow to allow more flexibility:

  • Supports partial regex parsing
    • Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
    • The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
  • Supports partial JSON parsing:
    • Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
    • Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
  • Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
  • Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

  • thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
  • reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

  • [x] Fix tool call id attribution logic (disabled for now) from https://github.com/ggml-org/llama.cpp/pull/12292
  • [x] Might need one last diff in the final response after a stream, say, to close any raw python code
  • [x] Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
    • Edit: OpenAI returns null logpropbs in tool call mode. Just need to ensure normal mode doesn't regress (test failing atm)
  • [x] Fix Mistral Nemo crash (llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L)
  • [ ] Send partial regex (common_regex) as separate PR: https://github.com/ggml-org/llama.cpp/pull/12808
  • [ ] Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
  • [ ] Command R7B's non-tool-calling template (they have 3 templates) forces <|START_RESPONSE|> at the end of the prompt. Output will contain an <|END_RESPONSE|> that needs handling (would fit nicely in new common_chat_syntax struct). Maybe combine w/ forced/disabled thinking modes as a follow up PR
  • [ ] Add some docs
  • [ ] Add more tests
  • [ ] Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

  • To make this faster, I suggest two options:
    • Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
    • Only compute and send deltas after N milliseconds

cc/ @jpohhhh

ochafik avatar Mar 14 '25 04:03 ochafik

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

IMO we don't necessary have to emulate this in a very accurate manner. From the perspective of an app developer who uses the API, it's obvious that streaming the function argument does not make sense since the client cannot decode a half-finished JSON in anyway. Often than not, you may find a code pattern in the client app which simply aggregate the chunks into a complete JSON, then decode it.

The stream: true used in OAI-compat API is mostly used for receiving the answer in real time. For example, if I ask the LLM "what is the weather like?", I don't care if the "get_weather" call is streamed or not, but what I do care is the final response "the weather is ..." can be streamed.

With this in mind, you could completely skip processing half-finished JSON to make it more simple.

ngxson avatar Mar 15 '25 15:03 ngxson

The stream: true used in OAI-compat API is mostly used for receiving the answer in real time. For example, if I ask the LLM "what is the weather like?", I don't care if the "get_weather" call is streamed or not, but what I do care is the final response "the weather is ..." can be streamed.

@ngxson Agree it would have been simpler, but a prime use case I have in mind is to stream long arguments, e.g. file diffs in IDE plugins. In https://github.com/cline/cline/pull/1946 I disabled the streaming (and recomposed the output to what cline expected), but I have full working TS code to create streaming parsers of partial json within openai chunked responses, which I hope to contribute to llama.vscode and maybe RooCode (since Cline is a no).

ochafik avatar Mar 15 '25 15:03 ochafik

but a prime use case I have in mind is to stream long arguments, e.g. file diffs in IDE plugins

Ok interesting use case, I didn't know about this before.

In this case, maybe we can only support streaming for models that can output JSON natively? Personally I still a bit doubt about depending on nlohmann::json SAX interface. This low-level parsing functionality can be powerful, but at the same time can be quite complicated for maintenance.

Also, I still don't fully understand the case where we need to decode partial JSON. Could you give an example? (i.e. which model and which chat template?)

ngxson avatar Mar 15 '25 15:03 ngxson

Also I'm thinking, would it be more simple to firstly support streaming on non-function-call responses first (ref my first comment), then start to support chunked arguments in a follow up PR? This way it's easier to review and also give you more time to experiment with that.

ngxson avatar Mar 15 '25 15:03 ngxson

In this case, maybe we can only support streaming for models that can output JSON natively?

@ngxson We'd still need some incremental json parsing support to know where the json ends, which was reason why I was using the SAX stuff until now.

Otherwise, I'd rather avoid coverage mismatch between streamed and non-streamed.

Personally I still a bit doubt about depending on nlohmann::json SAX interface. This low-level parsing functionality can be powerful, but at the same time can be quite complicated for maintenance.

Agree. For now it's just about okay, but I think having our own partial parser might just be less code overall / easier to maintain. This was me trying not to write a JSON parser by hand (which I did in the TS code I mentioned). Happy to revisit as a follow up.

Also, I still don't fully understand the case where we need to decode partial JSON. Could you give an example? (i.e. which model and which chat template?)

First off, the JSON we're trying to parse can be at various levels:

  • Inside an array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • As a standalone tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • As just the arguments object (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Deep inside more JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • And "implied" / no JSON for Python code, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

As for partial JSON, it can happen because we reached the token limit (newly supported!), or because we're in streaming mode.

But more interesting is where it happens:

  • If it happens inside an arguments object, we should return it partial
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

ochafik avatar Mar 15 '25 15:03 ochafik

Also I'm thinking, would it be more simple to firstly support streaming on non-function-call responses first (ref my first comment), then start to support chunked arguments in a follow up PR? This way it's easier to review and also give you more time to experiment with that.

Re/ reviewability, I'm thinking of sending the generic parser + partial json + regex support and their tests in 1-3 separate PRs.

Server tests are now all fixed, main remaining tasks (besides more tests) are logprobs + tool call id regressions.

ochafik avatar Mar 15 '25 16:03 ochafik

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back? The tool-calls will essential be re-assembled on the client side at the expense of a lot of JSON verbosity! And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

I like @ngxson idea about the state machine as this is something I was considering in the llama-cli toolcalls, as it would be necessary for cleaner output. In an ideal world, there is a simple stack which pushes tokens (perhaps a buffer >= 1) for nested begin/end delimiters and "all that is needed" (in theory) is basic token comparison. If the tool-call is opened, then wait until the rest arrives (or some timeout condition elapses).

bandoti avatar Mar 17 '25 17:03 bandoti

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back?

It is still needed for some use case, see https://github.com/ggml-org/llama.cpp/pull/12379#issuecomment-2726727434

JSON verbosity is not a big problem for now IMO, we can always enforce a minimum chunk length in the future, to reduce the number of SSE events to be emitted.

Btw @ochafik I would like to help on this if needed. At least the case for streaming response (without streaming toolcall) is very necessary for now. And I think @bandoti ask for that because at this point, this is the only blocking point to bring MCP into llama.cpp server Web UI.

ngxson avatar Mar 18 '25 10:03 ngxson

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

Tool calling is supported in a lot more places than open source IDE plugins (personally I want to use it in Home Assistant and Skyrim mods, currently I'm using prompt injecting alternatives because of the streaming issue).

antcodd avatar Mar 18 '25 20:03 antcodd

I'm getting a hard crash when sending a request that has a tool response as the last message.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

The last message was:

{
  "role":"tool",
  "tool_call_id":"AJHxgbA2l5We2057NiDORsHRtf6Vcqzt",
  "content":"Navigated to https://servethehome.com/"
}

The model is google_gemma-3-12b-it-Q6_K.gguf

I'm trying to use the puppeteer mcp server. Is there a way to get more detail on this error?

llowrey avatar Mar 21 '25 14:03 llowrey

I'm getting a hard crash when sending a request that has a tool response as the last message.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

@llowrey Thanks a lot for trying this out! If you could share the llama-server output w/ --verbose flag, that would help (in particular, the last couple of Parsing input with format: lines, and if possible the request: {... log to provide a full repro case)

The model is google_gemma-3-12b-it-Q6_K.gguf

Haven't tried gemma3 yet, will definitely spend time on it over the weekend! Have you tried any other model by any chance?

Btw @ochafik I would like to help on this if needed

@ngxson Hoping to spend this raining weekend on this, hope to send you a few parts to review if you have the time :-) (I'd also highly welcome some feedback on how I've plugged things into the slot + partial / final response logic so far)

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

@antcodd Just as with OpenAI's chat completion API, streaming is enabled through the "stream": true parameter and aims to be 100% compatible with OAI's format.

And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

@bandoti Aside from the streamed thoughts (already presentable as they come), there's definitely ways to use the tool calls as they trickle back (either w/ parallel tool calls once you have some complete calls and are still receiving others, or w/ a single tool call when an argument is very long, e.g. file diffs that can be partially applied on the fly as in cline, see https://github.com/cline/cline/pull/1946 ). I hope to contribute generators-based partial TypeScript JSON decoders once this gets in :-)

I like @ngxson idea about the state machine

@bandoti I like state machines too, but in this case there may be too many states to enumerate manually, the regexps of some formats do some funky grouping that kinda simplify the code. AND we can most likely turn this whole thing into a giant state machine using C++ coroutines once the project adopts C++20. Starting with something inefficient for ease of iteration / maintenance but got my eyes on the prize ;-)

ochafik avatar Mar 21 '25 15:03 ochafik

Thanks for the quick response @ochafik

Here's the console output:

srv  update_chat_: Parsing chat message: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Parsing input with format Generic: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 94: syntax error while parsing object key - unexpected end of input; expected string literal: <<<{"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",>>>
Parsed partial JSON: {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} (json_healing_marker: "278722862)
Cleaned up JSON {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} to {"tool_call":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}} (json_healing_marker : '"278722862')
Partial parse: incomplete tool call
Parsed message: {"role":"assistant","content":null,"tool_calls":[{"type":"function","function":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}}]}
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'
Aborted (core dumped)

Here's the POST body that causes this: crash.json

I can send with postman and get a crash every time

After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps.

llowrey avatar Mar 21 '25 16:03 llowrey

@llowrey that specific crash should now be fixed, thanks again for the full details!

ochafik avatar Mar 23 '25 14:03 ochafik

Trying this out for myself, specifically the streamed tool calls using Qwen2.5 14B I get the following behavior image

There is no error in the llama-server log but here it is: https://gist.github.com/Column01/bdce2d58e53e2d440d8bb3f124e64131

Column01 avatar Apr 03 '25 19:04 Column01

Trying this out for myself, specifically the streamed tool calls using Qwen2.5 14B I get the following behavior image

There is no error in the llama-server log but here it is: https://gist.github.com/Column01/bdce2d58e53e2d440d8bb3f124e64131

@Column01 thanks for sharing this! I would really advise against extreme KV quantizations (esp. K) as it seems to severely degrade tool call performance in most models I tested (in your case just switching to a less aggressive -ctk q8_0 might do the trick).

(I've updated docs/function-calling.md accordingly in this branch; also, tied up a few more loose ends that should make the Qwen2.5 14B experience smoother, please give it another go if you have a chance!)

ochafik avatar Apr 04 '25 20:04 ochafik

@ochafik thanks for all the work here.

I've been using this successfully with Llama 3.1 8B-Instruct, but have encountered a compatibility issue between this implementation and PydanticAI's OpenAI API compatible client, and I don't know which side is correct.

This branch seems to have special logic in common_chat_parse_llama_3_1 to ensure that a partial tool name is never sent, and in practice I do see the full tool name sent several times in the streaming responses from llama.cpp as tool updates are sent.

On the pydantic-ai side, if it receives a delta message that includes the tool name, it appends the content to what it already has.

Net effect is that in pydantic AI, the tool that ends up getting invoked is something like my_tool_namemy_tool_namemy_tool_namemy_tool_name....

Is this a pydantic AI issue, or an issue with this implementation?

BiffaloBuff avatar Apr 23 '25 21:04 BiffaloBuff

I've been using this successfully with Llama 3.1 8B-Instruct, but have encountered a compatibility issue between this implementation and PydanticAI's OpenAI API compatible client, and I don't know which side is correct.

I'm experiencing this issue consistently across all models tested on open-webui, which leverages langchain as its backend. I've documented reproduction steps below using langchain directly in Python.

For context, this PR would be transformative for the local-LLM community. While open-webui recently implemented native tool calling, it currently functions only when streaming is enabled. If llama.cpp could support native tool calls during streaming, this would finally enable proper tool utilization with local models.

Steps to Reproduce:

  1. Using the CPU build of this branch, launch the server with the bartowski/microsoft_Phi-4-mini-instruct-GGUF model as specified in the PR instructions. My docker-compose.yaml is available in the foldout below if you wish to replicate my exact setup.

  2. Execute a langchain tool calling agent in streaming mode. To replicate my exact setup, runpip install langchain langchain-openai and run the Python code provided in the foldout below.

  3. The Python logs show repeated tool call names: "add_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbers". In llama.cpp server logs, JSON parsing errors appear (possibly expected in this build) and the complete tool name is sent each time instead of deltas (data stream, to_send: data: {"choices":[{ ... "delta":{"tool_calls":[{ ... "function":{"name":"add_two_numbers", ... }}]}}]}).

Let me know if I can help in any way.

langchain-toolcall.py In case it matters, I used python 3.11. Create a langchain-toolcall.py file as such:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.globals import set_debug
set_debug(True)

@tool
def add_two_numbers(x: float, y: float) -> float:
    """Add 'x' and 'y'."""
    return x + y

prompt = ChatPromptTemplate.from_messages([
    ("system", "you're a helpful assistant"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

tools = [add_two_numbers]


llm = ChatOpenAI(
    model="llama-cpp-model",
    api_key="sk-null",
    base_url="http://localhost:8080/v1",
    disable_streaming=False,
)

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print (agent_executor.invoke({"input": "what's 3 plus 5", }))
docker-compose.yaml Create a docker-compose.yaml file as such and run `docker compose up --build -d`.
services:
  llama-cpp-server:
    build:
      context: https://github.com/ochafik/llama.cpp.git#tool-diffs
      dockerfile: .devops/cpu.Dockerfile
      target: full
    ports:
      - "0.0.0.0:8080:8080"
    command: --jinja --alias llama-cpp-model --host 0.0.0.0 --verbose -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    entrypoint: ./llama-server

colout avatar Apr 27 '25 16:04 colout

@ochafik Fantastic work here. Do you have an ETA for getting this PR ready to merge?

I'm experimenting with the Continue.dev VSCode extension in OpenShift Dev Spaces (Eclipse Che), and using Llama.cpp to serve models from the OpenShift cluster.

This PR is a breakout feature for Llama.cpp IMO.

Cheers.

cgruver avatar Apr 28 '25 12:04 cgruver

@ochafik @ericcurtin PTAL - https://github.com/ochafik/llama.cpp/pull/3

cgruver avatar Apr 30 '25 20:04 cgruver

Hope this PR won't be forgotten or dropped, without it some interesting recent tools doesn't work with LCPP (in particular https://github.com/bytedance/deer-flow gives an error: openai.InternalServerError: Error code: 500 - {'error': {'code': 500, 'message': 'Cannot use tools with stream', 'type': 'server_error'}})

drrros avatar May 13 '25 07:05 drrros

I'm stoked and waiting for this as well, but sadly many MCP tools currently seem to have some compatibility problems with Llama.cpp from what I've seen. Dive, Aider's MCP PR, and others I've tried. Streaming support would make a big difference, but I'm honestly not sure it's the only issue.

Perhaps I just resolved the conflicts incorrectly when pulling and merging this PR, or perhaps it's not far enough along yet.

(Roo-Code's MCP calls works great with or without streaming as it seems to work around tool calling, likely in the normal prompts.)

strawberrymelonpanda avatar May 13 '25 10:05 strawberrymelonpanda

Sorry everyone for the lack of activity. Perfect storm of job change and life events (all good!). Will try and push this (and related PRs) through in the next week, as I'm unsure how much I'll be able to do afterwards 😅.

ochafik avatar May 14 '25 15:05 ochafik

@ochafik Congrats and good luck. Obviously I think folks just want to encourage, not demand. Life always comes first.

strawberrymelonpanda avatar May 14 '25 18:05 strawberrymelonpanda

@ochafik @strawberrymelonpanda absolutely right — not in any way demanding! Best wishes!

drrros avatar May 14 '25 19:05 drrros

Sorry everyone for the lack of activity. Perfect storm of job change and life events (all good!). Will try and push this (and related PRs) through in the next week, as I'm unsure how much I'll be able to do afterwards 😅.

Best of luck in the new role!

ericcurtin avatar May 14 '25 21:05 ericcurtin

Streaming support would make a big difference, but I'm honestly not sure it's the only issue.

This merged PR on Minja a couple of hours ago (which I believe Llama.CPP uses) might just solve the problem I was having above.

strawberrymelonpanda avatar May 15 '25 15:05 strawberrymelonpanda

For those who are following this PR, I am trying to maintain a merge from this branch and the master branch of llama.cpp here - https://github.com/cgruver/llama.cpp/tree/tools

cgruver avatar May 16 '25 14:05 cgruver

DcNIsW

unclemusclez avatar May 18 '25 22:05 unclemusclez

Seems there's a bug in the current code version with executing streaming tools under reasoning models.

I'm trying it with Qwen3 and the following sequence causes a server crash:

  • user query
  • reasons -> calls tool
  • tool response
  • reasons -> tries to respond

As soon as the second reasoning section is cleared, the server crashes with:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '<think>*truncated for brevity*... The user might be looking for the slang definition, so I should highlight that. Also, note the different contexts like productivity issues and meme culture. Make sure to mention the sources and provide a concise explanation.</think>The term **"brainrot"** has multiple contextual meanings based on the search results' not found at start of '<think>*truncated for brevity*... The user might be looking for the slang definition, so I should highlight that. Also, note the different contexts like productivity issues and meme culture. Make sure to mention the sources and provide a concise explanation.</think>'

The chat that triggers it looks like this:

[
  {
    "role": "user",
    "content": "Let's search the web for \"brainrot\"."
  },
  {
    "role": "assistant",
    "tool_calls": [
      {
        "id": "0",
        "name": "search",
        "arguments": {
          "query": "brainrot"
        }
      }
    ]
  },
  {
    "role": "tool",
    "content": "*tool call dump truncated*",
    "tool_call_id": "0"
  }
]

Before that, I get warnings like this: Grammar still awaiting trigger after token 2711 ( search) (after every token that succeeds the thinking block)

pwilkin avatar May 19 '25 16:05 pwilkin

This is super awesome! I'm having a blast with this right now. I wanted to create a simple wrapper script to test it out with Qwen3 to see how it went. So far, it seems like it's working as intended.

I followed the instructions and tested it out with curl to see how it went and got a successful response, so I took it a few steps further and built a minimal wrapper script using OpenAI tooling to test its limits.

Server request seems clean. Model is able to respond and execute the command as well. Still working my way towards chaining messages together.

Llama Server Instance Command
llama-server --port 8080 --n-gpu-layers 32 --ctx-size 16384 --pooling mean --slots --jinja -fa -m /mnt/valerie/models/Qwen/Qwen3-1.7B/ggml-model-f16.gguf
Dot Env
OPENAI_API_KEY=sk-no-key-required
OPENAI_BASE_URL=http://localhost:8080/v1
Source File
import json
import os
import sys

import dotenv
from openai import OpenAI
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk

from agent.tools.weather import get_weather

ESCAPE = "\x1b"
BOLD = ESCAPE + "[1m"
UNDERLINE = ESCAPE + "[4m"
RESET = ESCAPE + "[0m"


def create_client():
    # Load environment
    dotenv.load_dotenv(".env")

    api_key = os.getenv("OPENAI_API_KEY", "")
    base_url = os.getenv("OPENAI_BASE_URL", "")

    if not api_key:
        raise ValueError("EnvironmentError: OPENAI_API_KEY not set in .env")

    # Setup default base URL if using local mode
    if api_key == "sk-no-key-required" and not base_url:
        base_url = "http://localhost:8080/v1"

    # Initialize client
    return OpenAI(api_key=api_key, base_url=base_url)


def stream_response(response):
    tool_call_buffer = ""
    buffering_tool = False
    finish_reason = None

    for chunk in response:
        if isinstance(chunk, ChatCompletionChunk):
            delta = chunk.choices[0].delta
            finish_reason = chunk.choices[0].finish_reason

            # Handle streaming reasoning
            if delta.content:
                content = delta.content
                if content == "<think>":
                    print(f"{UNDERLINE}{BOLD}Thinking{RESET}", end="\n")
                elif content == "</think>":
                    print(f"\n{UNDERLINE}{BOLD}Completion{RESET}", end="")
                else:
                    print(content, end="")
                sys.stdout.flush()

            # Handle tool call streaming
            if delta.tool_calls:
                buffering_tool = True
                for tool_call in delta.tool_calls:
                    arguments = tool_call.function.arguments or ""
                    tool_call_buffer += arguments

    print()  # Newline after stream ends

    # Dispatch if tool call is complete
    if buffering_tool and finish_reason == "tool_calls":
        try:
            tool_args = json.loads(tool_call_buffer)
            print(f"\n{UNDERLINE}{BOLD}Calling Tool...{RESET}")
            result = get_weather(**tool_args)
            print(f"\n{UNDERLINE}{BOLD}Tool Result:{RESET} {result}")
        except json.JSONDecodeError:
            print(f"{BOLD}Warning:{RESET} Failed to decode tool call arguments.")


def main():
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Retrieves current weather for the given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "units": {
                            "type": "string",
                            "enum": ["metric", "uscs"],
                            "description": "The unit system. Default is 'metric'.",
                        },
                    },
                    "required": ["location", "units"],
                    "additionalProperties": False,
                },
                "strict": True,
            },
        }
    ]

    # Sample chat sequence
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the weather like in Paris today?"},
    ]

    try:
        client = create_client()
        response = client.chat.completions.create(
            model="qwen3",  # Use "gpt-4" for OpenAI, "qwen3" for local
            messages=messages,
            stream=True,
            temperature=0.8,
            tools=tools,
        )
        stream_response(response)
    except Exception as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    main()
Llama Server Request
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 746 | processing task
slot update_slots: id  0 | task 746 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 223
slot update_slots: id  0 | task 746 | need to evaluate at least 1 token to generate logits, n_past = 223, n_prompt_tokens = 223
slot update_slots: id  0 | task 746 | kv cache rm [222, end)
slot update_slots: id  0 | task 746 | prompt processing progress, n_past = 223, n_tokens = 1, progress = 0.004484
slot update_slots: id  0 | task 746 | prompt done, n_past = 223, n_tokens = 1
slot      release: id  0 | task 746 | stop processing: n_past = 345, truncated = 0
slot print_timing: id  0 | task 746 | 
prompt eval time =      29.95 ms /     1 tokens (   29.95 ms per token,    33.38 tokens per second)
       eval time =    2268.35 ms /   123 tokens (   18.44 ms per token,    54.22 tokens per second)
      total time =    2298.30 ms /   124 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Screencast From 2025-05-20 00-44-34.webm

I love this. This is awesome. I've been jonsing for local memory management and this absolutely opens the door for doing exactly that. I can not express enough gratitude for all the work that's gone into all of this. Awesome work!

I'll keep an eye out for edge cases if I think it's relevant to this PR.

One minor bug I think I already spotted is that the initial tokens are coupled.

print(f"::{content}", end="")

I just prepended a pair of colons together to attempt to reveal why I couldn't select <think> which I can do in the master branch.

::<think>Okay::,:: the:: user:: is:: asking:: 

<think> and Okay should be separate tokens?

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<think>Okay"}}],"created":1747719891,"id":"chatcmpl-WDzWFkqkzx7pLHN3wBk9odyvdB1szfBn","mo
del":"gpt-3.5-turbo","system_fingerprint":"b5510-810c4c32","object":"chat.completion.chunk"}

teleprint-me avatar May 20 '25 04:05 teleprint-me