llm Mechanism for attaching tool execution requests to a Response

Part of tools: #898

Apr 20 '25 03:04 simonw

The hard part here will be dealing with streaming. Here's what OpenAI does there, from https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming

{"type":"response.output_item.added","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_1234xyz","name":"get_weather","arguments":""}}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"{\""}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"location"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\":\""}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"Paris"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":","}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":" France"}
{"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\"}"}
{"type":"response.function_call_arguments.done","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"arguments":"{\"location\":\"Paris, France\"}"}
{"type":"response.output_item.done","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_2345abc","name":"get_weather","arguments":"{\"location\":\"Paris, France\"}"}}

Apr 20 '25 03:04 simonw

Ugh, I also need to decide if/how I'm going to support multiple parallel tool call requests, which are a thing for at least OpenAI and Gemini:

https://platform.openai.com/docs/guides/function-calling/parallel-function-calling?api-mode=chat#parallel-function-calling
https://ai.google.dev/gemini-api/docs/function-calling?example=meeting#parallel_function_calling

Apr 20 '25 03:04 simonw

I think I'm going to model this so it can ALWAYS represent multiple tool execution requests, then models that only support one tool at a time can populate a list with a single item and everything will work fine.

May 10 '25 17:05 simonw

Two ways this could work:

We hold off on executing any tool calls until the request has finished coming in
We integrate with the streaming mechanism, so a tool call can be kicked off even while the response is still being generated (pretty cool, especially for async stuff)

I'm going to implement the easier option first, but I'll try to design it so if I want to upgrade to that more advanced version later I haven't trapped myself.

May 10 '25 17:05 simonw

Since the response object is available inside that .execute() method already, the easiest thing to do would be to have a response.request_tool(...) method or similar.

This can be called multiple times, which means models that implement streaming could already call it while the stream is coming in.

Open question: do we start to execute the tool when that method is called, or do we do that entirely outside of the .execute() method? I think the simpler option is we only execute tools once the full response has been completed (leaving the option open for something fancier, see previous comment).

This looks similar to response.set_usage(...).

May 10 '25 17:05 simonw

What to call this?

Claude returns messages of type tool_use:

{
    "type": "tool_use",
    "id": "toolu_01A09q90qw90lq917835lq9",
    "name": "get_weather",
    "input": {
        "location": "San Francisco, CA",
        "unit": "celsius"
    }
}

OpenAI puts it in a "tool_calls" list with things like this:

[{
    "id": "call_12345xyz",
    "type": "function",
    "function": {
        "name": "get_weather",
        "arguments": "{\"location\":\"Paris, France\"}"
    }
}]

Gemini has response.functionCalls.

Ollama has response.message.tool_calls.

May 10 '25 17:05 simonw

Decision: I'm going with response.add_tool_call(...) - the feature is called "tools" and having it as add_tool_call() reminds us that it can be called more than once.

(I considered response.request_tool_call() but I don't like that request is also a noun that often pairs with response in web frameworks.)

May 10 '25 18:05 simonw

Next question: do I need a ToolCall abstraction to paper over differences between different models? I'm going to assume so and start with that, I'll simplify later if it's not needed.

@dataclass
class ToolCall:
    name: str
    arguments: dict

May 10 '25 18:05 simonw

To exercise this I need a real implementation. Switching to:

#988

May 10 '25 18:05 simonw

I should stash these as self._tool_calls so that there can be response.tool_calls() and await response.tool_calls() methods that force the response to execute (._force()) before returning the tools.

May 10 '25 18:05 simonw

Demonstrated this working by the end of:

https://github.com/simonw/llm/issues/937#issuecomment-2870479021

May 12 '25 01:05 simonw

llm llm copied to clipboard

Mechanism for attaching tool execution requests to a Response

llm
llm copied to clipboard