llm
llm copied to clipboard
Mechanism for attaching tool execution requests to a Response
Part of tools: #898
The hard part here will be dealing with streaming. Here's what OpenAI does there, from https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming
{"type":"response.output_item.added","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_1234xyz","name":"get_weather","arguments":""}}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"{\""}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"location"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\":\""}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"Paris"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":","}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":" France"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\"}"}
{"type":"response.function_call_arguments.done","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"arguments":"{\"location\":\"Paris, France\"}"}
{"type":"response.output_item.done","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_2345abc","name":"get_weather","arguments":"{\"location\":\"Paris, France\"}"}}
Ugh, I also need to decide if/how I'm going to support multiple parallel tool call requests, which are a thing for at least OpenAI and Gemini:
- https://platform.openai.com/docs/guides/function-calling/parallel-function-calling?api-mode=chat#parallel-function-calling
- https://ai.google.dev/gemini-api/docs/function-calling?example=meeting#parallel_function_calling
I think I'm going to model this so it can ALWAYS represent multiple tool execution requests, then models that only support one tool at a time can populate a list with a single item and everything will work fine.
Two ways this could work:
- We hold off on executing any tool calls until the request has finished coming in
- We integrate with the streaming mechanism, so a tool call can be kicked off even while the response is still being generated (pretty cool, especially for async stuff)
I'm going to implement the easier option first, but I'll try to design it so if I want to upgrade to that more advanced version later I haven't trapped myself.
Since the response object is available inside that .execute() method already, the easiest thing to do would be to have a response.request_tool(...) method or similar.
This can be called multiple times, which means models that implement streaming could already call it while the stream is coming in.
Open question: do we start to execute the tool when that method is called, or do we do that entirely outside of the .execute() method? I think the simpler option is we only execute tools once the full response has been completed (leaving the option open for something fancier, see previous comment).
This looks similar to response.set_usage(...).
What to call this?
Claude returns messages of type tool_use:
{
"type": "tool_use",
"id": "toolu_01A09q90qw90lq917835lq9",
"name": "get_weather",
"input": {
"location": "San Francisco, CA",
"unit": "celsius"
}
}
OpenAI puts it in a "tool_calls" list with things like this:
[{
"id": "call_12345xyz",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Paris, France\"}"
}
}]
Gemini has response.functionCalls.
Ollama has response.message.tool_calls.
Decision: I'm going with response.add_tool_call(...) - the feature is called "tools" and having it as add_tool_call() reminds us that it can be called more than once.
(I considered response.request_tool_call() but I don't like that request is also a noun that often pairs with response in web frameworks.)
Next question: do I need a ToolCall abstraction to paper over differences between different models? I'm going to assume so and start with that, I'll simplify later if it's not needed.
@dataclass
class ToolCall:
name: str
arguments: dict
To exercise this I need a real implementation. Switching to:
- #988
I should stash these as self._tool_calls so that there can be response.tool_calls() and await response.tool_calls() methods that force the response to execute (._force()) before returning the tools.
Demonstrated this working by the end of:
- https://github.com/simonw/llm/issues/937#issuecomment-2870479021