Simon Willison
Simon Willison
Looks like there's new code for chat in this branch: https://github.com/Blaizzy/mlx-vlm/tree/pc/video - e.g. https://github.com/Blaizzy/mlx-vlm/commit/810fb532c873054bdcb35998719538732a99a5f1
Here's a concrete example of how I'd like to be able to use `mlx-vlm` taken from my new `llm-mlx` plugin: https://github.com/simonw/llm-mlx/blob/01fa4ed83deab763af2d05ea2594ce857eeae532/llm_mlx.py#L76-L105 More information on that here: https://simonwillison.net/2025/Feb/15/llm-mlx/
The hard part here will be dealing with streaming. Here's what OpenAI does there, from https://platform.openai.com/docs/guides/function-calling?api-mode=responses#streaming ``` {"type":"response.output_item.added","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_1234xyz","name":"get_weather","arguments":""}} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"{\""} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"location"} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\":\""} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"Paris"} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":","} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":" France"} {"type":"response.function_call_arguments.delta","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"delta":"\"}"} {"type":"response.function_call_arguments.done","response_id":"resp_1234xyz","item_id":"fc_1234xyz","output_index":0,"arguments":"{\"location\":\"Paris, France\"}"} {"type":"response.output_item.done","response_id":"resp_1234xyz","output_index":0,"item":{"type":"function_call","id":"fc_1234xyz","call_id":"call_2345abc","name":"get_weather","arguments":"{\"location\":\"Paris, France\"}"}}...
Ugh, I also need to decide if/how I'm going to support multiple parallel tool call requests, which are a thing for at least OpenAI and Gemini: - https://platform.openai.com/docs/guides/function-calling/parallel-function-calling?api-mode=chat#parallel-function-calling - https://ai.google.dev/gemini-api/docs/function-calling?example=meeting#parallel_function_calling
I think I'm going to model this so it can ALWAYS represent multiple tool execution requests, then models that only support one tool at a time can populate a list...
Two ways this could work: - We hold off on executing any tool calls until the request has finished coming in - We integrate with the streaming mechanism, so a...
Since the `response` object is available inside that `.execute()` method already, the easiest thing to do would be to have a `response.request_tool(...)` method or similar. This can be called multiple...
What to call this? Claude returns messages of type `tool_use`: ```json { "type": "tool_use", "id": "toolu_01A09q90qw90lq917835lq9", "name": "get_weather", "input": { "location": "San Francisco, CA", "unit": "celsius" } } ``` OpenAI...
Decision: I'm going with `response.add_tool_call(...)` - the feature is called "tools" and having it as `add_tool_call()` reminds us that it can be called more than once. (I considered `response.request_tool_call()` but...
Next question: do I need a `ToolCall` abstraction to paper over differences between different models? I'm going to assume so and start with that, I'll simplify later if it's not...