Streaming Doesn't Work When Using Tools
Description
When using tools with stream=True in the chat function, the streaming output doesn't work as expected. The responses are not yielded incrementally, but instead, they seem to be processed only after all tool calls are completed.
Steps to Reproduce
- Use the following code to set up a chat with a tool-enabled model:
import ollama
OLLAMA_CLIENT = ollama.Client(host="172.16.2.96:11434")
def get_location() -> str:
"""
Get the current geographic location.
Returns:
str: Current geographic location.
"""
return "Shanghai"
def get_weather(location: str) -> str:
"""
Get the weather conditions for a specific location.
Args:
location (str): Geographic location.
Returns:
str: Weather conditions.
"""
return "Sunny, temperature 25°C."
available_functions = {
'get_location': get_location,
'get_weather': get_weather,
}
def chat_generate(options):
tool_calls = []
for part in OLLAMA_CLIENT.chat(**options):
yield part.message.content # Expecting incremental streaming here
options["messages"].append(part.message)
if part.message.tool_calls:
for tool in part.message.tool_calls:
if function_to_call := available_functions.get(tool.function.name):
output = function_to_call(**tool.function.arguments)
tool_calls.append({
'role': 'tool', 'content': str(output), 'name': tool.function.name
})
for result in tool_calls:
print(result)
options["messages"].append(result)
if len(tool_calls) > 0:
for result in chat_generate(options):
yield result
for result in chat_generate({
"model": "qwq",
"messages": [
{"role": "user", "content": "How is the weather now?"},
],
"stream": True,
"tools": [get_location, get_weather]
}):
print(result)
- Run the script and observe the behavior.
Expected Behavior
The response should be streamed incrementally, even when tool calls are involved.
Actual Behavior
- The response is blocked until all tool calls are completed.
- No incremental output is yielded while waiting for the tool responses.
Environment
- Ollama Server Version: 0.5.12
- Ollama Python Client Version: 0.4.7
- Python: 3.8.5
- OS: windows11
Additional Information
It seems like the OLLAMA_CLIENT.chat function doesn't return partial results when tool calls are required. Is there a way to make tool calls asynchronous while maintaining streaming behavior?
Hi, when you provide tools to chat and set stream=True you'll either get a normal streams if the model doesn't call any tool, or, in your case, you should get two dictionaries, the first containing the tool_calls key and the second will contain some metadata such as eval_count and similar.
Basically in your code the parts should be something like the following:
{'model': 'qwq', 'created_at': '...', 'message': {'role': 'assistant', 'content': '', 'tool_calls': [{'function': {'name': 'name_of_function', 'arguments': {...}}}]}, 'done': False}
{'model': 'qwq', 'created_at': '...', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': ..., 'load_duration': ..., 'prompt_eval_count': ..., 'prompt_eval_duration': ..., 'eval_count': ..., 'eval_duration': ...}
I finally understand what you mean:
I adapted your script, and I'm sending 2 inputs to LLM (with messages from input 1 passed to input 2, to simulate history).
First input is asking the same question as yours, which involve 2 tools calls. For me the result is acceptable:
but for second input, which involves 2 other functions, it's not the same behavior (notice the tool call results logged at the exact same time):
but, I'm not sure if the LLM got the tool output at the same time than me, as it's thinking process doesn't reflect problems:
<think>
Okay, let me tackle the user's questions step by step. First, they asked about the tools I have access to. Looking back at the provided functions in the initial setup, there are several: browse_url, get_current_date_time, get_location, get_weather, list_files, read_file, and run_command. So I need to list those out clearly.
Next, the user wants to know how many files are in the current directory. The previous tool response from list_files showed a list of files. Let me count them again to be sure. The array has entries like 'scratchpad.py', 'tests_ollama_with_tools_advanced.py', etc., and when I check the length, there are 27 files listed. Wait, let me recount: starting from scratchpad.py up to openai_settings.py... yes, that's 27 items.
Then they asked about the file 'tests_ollama.sh'. The read_file tool was used, and the content shows it's a bash script for interacting with Ollama. It includes functions like selecting models, handling tool calls, and performing web searches via curl commands. I should summarize that this script is for testing Ollama by letting users choose a model and send queries, which then trigger function calls like web_search.
I need to present all three answers clearly: list the tools first, state the file count, explain the script's purpose. Make sure each part is concise but informative so the user gets exactly what they asked for without extra fluff.
</think>
I also had been positively surprised that this setup makes the LLM to be able to retry on tool call errors
Can reproduce this issue
Script used:
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "ollama",
# ]
# ///
import asyncio
from ollama import AsyncClient
async def chat():
message = {'role': 'user', 'content': 'hello!'}
async for part in await AsyncClient().chat(
model='qwen2.5:7b',
messages=[message],
stream=True,
tools=[{'type': 'function', 'function': {'name': 'get_random', 'description': 'Get a random number\n\n Returns:\n float: a random number\n ', 'parameters': {'properties': {}, 'title': 'get_randomArguments', 'type': 'object'}}}]
):
print(part['message']['content'], end='', flush=True)
asyncio.run(chat())
With tools
https://github.com/user-attachments/assets/b7871045-1cfb-4384-9b16-19d4868dafad
With tools commented:
https://github.com/user-attachments/assets/a7d0a8ee-6f16-4412-b53d-a5431837b128
After further testing with API, I can conclude that the issue is from the model API itself, not the python binding
/api/chat
{
"model": "qwen2.5:7b",
"messages": [
{
"role": "user",
"content": "hello!"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_random",
"description": "Get a random number\n\n Returns:\n float: a random number\n ",
"parameters": {
"properties": {},
"title": "get_randomArguments",
"type": "object"
}
}
}
]
}
https://github.com/ollama/ollama/issues/6127
https://github.com/ollama/ollama/pull/9938
Being worked on! Will try to get back to it soon https://github.com/ollama/ollama/pull/10028
@ParthSareen any updates ?
almost here!! https://github.com/ollama/ollama/pull/10415 @MohamedYasserOaf
@ParthSareen caught you in the middle of the heat xD , good luck and looking forward to the fix.
yes please
@ParthSareen hope this lands soon! Fingers crossed :)
Experiencing the same issue. When tools are available but not called by the model, the response is returned all at once instead of streaming incrementally.
Will be in next release!