ollama-python Streaming Doesn't Work When Using Tools

Description

When using tools with stream=True in the chat function, the streaming output doesn't work as expected. The responses are not yielded incrementally, but instead, they seem to be processed only after all tool calls are completed.

Steps to Reproduce

Use the following code to set up a chat with a tool-enabled model:

import ollama

OLLAMA_CLIENT = ollama.Client(host="172.16.2.96:11434")

def get_location() -> str:
    """
    Get the current geographic location.

    Returns:
    str: Current geographic location.
    """
    return "Shanghai"


def get_weather(location: str) -> str:
    """
    Get the weather conditions for a specific location.

    Args:
    location (str): Geographic location.

    Returns:
    str: Weather conditions.
    """
    return "Sunny, temperature 25°C."

available_functions = {
    'get_location': get_location,
    'get_weather': get_weather,
}

def chat_generate(options):
    tool_calls = []
    for part in OLLAMA_CLIENT.chat(**options):
        yield part.message.content  # Expecting incremental streaming here
        options["messages"].append(part.message)
        if part.message.tool_calls:
            for tool in part.message.tool_calls:
                if function_to_call := available_functions.get(tool.function.name):
                    output = function_to_call(**tool.function.arguments)
                    tool_calls.append({
                        'role': 'tool', 'content': str(output), 'name': tool.function.name
                    })

    for result in tool_calls:
        print(result)
        options["messages"].append(result)

    if len(tool_calls) > 0:
        for result in chat_generate(options):
            yield result

for result in chat_generate({
    "model": "qwq",
    "messages": [
        {"role": "user", "content": "How is the weather now?"},
    ],
    "stream": True,
    "tools": [get_location, get_weather]
}):
    print(result)

Run the script and observe the behavior.

Expected Behavior

The response should be streamed incrementally, even when tool calls are involved.

Actual Behavior

The response is blocked until all tool calls are completed.
No incremental output is yielded while waiting for the tool responses.

Environment

Ollama Server Version: 0.5.12
Ollama Python Client Version: 0.4.7
Python: 3.8.5
OS: windows11

Additional Information

It seems like the OLLAMA_CLIENT.chat function doesn't return partial results when tool calls are required. Is there a way to make tool calls asynchronous while maintaining streaming behavior?

Mar 08 '25 03:03 VacantHusky

Hi, when you provide tools to chat and set stream=True you'll either get a normal streams if the model doesn't call any tool, or, in your case, you should get two dictionaries, the first containing the tool_calls key and the second will contain some metadata such as eval_count and similar.

Basically in your code the parts should be something like the following:

{'model': 'qwq', 'created_at': '...', 'message': {'role': 'assistant', 'content': '', 'tool_calls': [{'function': {'name': 'name_of_function', 'arguments': {...}}}]}, 'done': False}
{'model': 'qwq', 'created_at': '...', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': ..., 'load_duration': ..., 'prompt_eval_count': ..., 'prompt_eval_duration': ..., 'eval_count': ..., 'eval_duration': ...}

Mar 11 '25 18:03 antoninoLorenzo

I finally understand what you mean:

I adapted your script, and I'm sending 2 inputs to LLM (with messages from input 1 passed to input 2, to simulate history).

First input is asking the same question as yours, which involve 2 tools calls. For me the result is acceptable:

but for second input, which involves 2 other functions, it's not the same behavior (notice the tool call results logged at the exact same time):

but, I'm not sure if the LLM got the tool output at the same time than me, as it's thinking process doesn't reflect problems:

<think>

Okay, let me tackle the user's questions step by step. First, they asked about the tools I have access to. Looking back at the provided functions in the initial setup, there are several: browse_url, get_current_date_time, get_location, get_weather, list_files, read_file, and run_command. So I need to list those out clearly.

Next, the user wants to know how many files are in the current directory. The previous tool response from list_files showed a list of files. Let me count them again to be sure. The array has entries like 'scratchpad.py', 'tests_ollama_with_tools_advanced.py', etc., and when I check the length, there are 27 files listed. Wait, let me recount: starting from scratchpad.py up to openai_settings.py... yes, that's 27 items.

Then they asked about the file 'tests_ollama.sh'. The read_file tool was used, and the content shows it's a bash script for interacting with Ollama. It includes functions like selecting models, handling tool calls, and performing web searches via curl commands. I should summarize that this script is for testing Ollama by letting users choose a model and send queries, which then trigger function calls like web_search.

I need to present all three answers clearly: list the tools first, state the file count, explain the script's purpose. Make sure each part is concise but informative so the user gets exactly what they asked for without extra fluff.
</think>

I also had been positively surprised that this setup makes the LLM to be able to retry on tool call errors

Mar 17 '25 14:03 lemassykoi

Can reproduce this issue

Script used:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "ollama",
# ]
# ///

import asyncio
from ollama import AsyncClient

async def chat():
  message = {'role': 'user', 'content': 'hello!'}
  async for part in await AsyncClient().chat(
          model='qwen2.5:7b', 
          messages=[message], 
          stream=True, 
          tools=[{'type': 'function', 'function': {'name': 'get_random', 'description': 'Get a random number\n\n    Returns:\n        float: a random number\n    ', 'parameters': {'properties': {}, 'title': 'get_randomArguments', 'type': 'object'}}}]
          ):
    print(part['message']['content'], end='', flush=True)

asyncio.run(chat())

With tools

https://github.com/user-attachments/assets/b7871045-1cfb-4384-9b16-19d4868dafad

With tools commented:

https://github.com/user-attachments/assets/a7d0a8ee-6f16-4412-b53d-a5431837b128

Mar 19 '25 14:03 8LWXpg

After further testing with API, I can conclude that the issue is from the model API itself, not the python binding

/api/chat

{
  "model": "qwen2.5:7b",
  "messages": [
    {
      "role": "user",
      "content": "hello!"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_random",
        "description": "Get a random number\n\n    Returns:\n        float: a random number\n    ",
        "parameters": {
          "properties": {},
          "title": "get_randomArguments",
          "type": "object"
        }
      }
    }
  ]
}

Mar 19 '25 15:03 8LWXpg

https://github.com/ollama/ollama/issues/6127

Mar 22 '25 21:03 lemassykoi

https://github.com/ollama/ollama/pull/9938

Mar 23 '25 12:03 8LWXpg

Being worked on! Will try to get back to it soon https://github.com/ollama/ollama/pull/10028

Mar 31 '25 23:03 ParthSareen

@ParthSareen any updates ?

May 03 '25 00:05 MohamedYasserOaf

almost here!! https://github.com/ollama/ollama/pull/10415 @MohamedYasserOaf

May 03 '25 00:05 ParthSareen

@ParthSareen caught you in the middle of the heat xD , good luck and looking forward to the fix.

May 03 '25 00:05 MohamedYasserOaf

yes please

May 10 '25 07:05 tobiaswuerth

@ParthSareen hope this lands soon! Fingers crossed :)

May 15 '25 21:05 mukesh-dream11

Experiencing the same issue. When tools are available but not called by the model, the response is returned all at once instead of streaming incrementally.

May 19 '25 12:05 jonigl

Will be in next release!

May 27 '25 01:05 ParthSareen