Custom Tool Parser for Open Source Models
Feature Description
When using LLM serving frameworks such as vLLM or MLC-LLM , or services that host open-source models like DeepInfra, Fireworks, or OpenRouter, you sometimes run into an issue where the model being served doesn't have a dedicated tool parser yet the model does support tool use. This usually means the tools parameter in their OpenAI compliant API either doesn't work or causes an error and you'll have to parse the chat completion for any tool calls manually after the request.
While creating a custom provider can address this, it'll need ongoing maintenance and may lead to missing out on new provider features unless manually implemented.
To address this, I suggest adding a setting for a custom tool parser that can be passed to the OpenAI provider when in compatible mode. This feature would allow you to define a function that processes either the response message when using generateText or a stream when using streamText to determine if the response includes a tool call. This way, you can still keep all the benefits of the tool features of the SDK while serving your own models or using an open source model hosting service.
Example usage of a basic parser
import { createOpenAI } from '@ai-sdk/openai';
import { isParsableJson } from "@ai-sdk/provider-utils";
const llama = createOpenAI({
// other settings
compatibility: 'compatible',
textToolParser: (response: string) => {
if (!response.startsWith("<|python_tag|>")) return [];
response = response.replace("<|python_tag|>", "");
if (!isParsableJson(response)) {
return [];
}
const parsed: Array<{ name: string, arguments: Record<string, unknown>}> = JSON.parse(response)
return parsed;
},
streamToolParser: (chunk: LanguageModelV1StreamPart) => {
if (chunk.type !== "text-delta") return;
if (chunk.textDelta.startsWith("<|python_tag|>") {
//rest of the implementation
}
});
Use Case
- If you're hosting your own models and want to incorporate tools into your project but the serving framework doesn't support tool use for that model or tools are not included in the chat template.
- When using a model hosting service that either doesn’t support tools for specific models or lacks tool functionality altogether.
- In projects where you use different models with different tool response formats, making it difficult to parse and handle tool calls in a consistent way.
Additional context
For my team’s project, we host several open-source models and switch between them based on the situation or context—Llama 3.1 for general conversations, Mistral for RAG use cases, Qwen for coding, etc. This has led to a lot of iteration on custom providers to support tool use across models, so having this level of customization natively in the SDK would be great. It would let us use the AI SDK for our internal LLM tooling as well (benchmarks, RAG arenas).
I'm not married to the example I showed above, we can discuss a different implementation. I’d be more than happy to work on this and submit a PR if that’s helpful.
I would prefer a middleware implementation so it can easily be used with different providers such as Ollama, llama.cpp (future), OpenAI compat
Just to clarify - are you suggesting this feature should be implemented as middleware instead and added to the SDK?
If not, would you like me to contribute some documentation around this using middleware? Figure it might help others who are trying to use the SDK with their own hosted models and want an example for handling tool calls, could include an example for llama or qwen 2.5.
I'm quite intrigued by this feature. The idea is that you can define a system prompt for the definition of a tool for models that don't support tool in your middleware, and if a special token like <tool_call> is returned, you can invoke the too_call parser in your middleware to call the tool.
We can make a POC that is currently not feasible in the ai sdk. If it is successful and the performance is good, it can be included in a library such as ai/middleware.
@ShervK For a quick POC full example, please refer to the repo below https://github.com/minpeter/ai-sdk-preview/tree/tool-call-middleware/packages/ai-config
I did a prompting guide tool call of hermes 3 type and got decent performance. I tested on qwen-based 32, 72b models and successfully ran the multi-step example. However, when tested on smaller models, it often caused hallucinations or output incorrect JSON.
Also, I currently have hardcoded the schema and did not consider calling parallel tools. It is not a big problem and will be fixed soon.
Additionally, We (So what I mean is ...FriendliAI ) prepare to exclusively offer tool calls for custom models on dedicated endpoints. It includes a cool feature to suppress hallucinations in tool calls on smaller models. If you're interested, drop me an email at [email protected] (You can also try out llama 3.1 8b, which already has this feature, on serverless endpoints.
@minpeter Thanks for the example! We're actually using middleware in our project for other features already but not for tool parsing.
In our case, we ended up making a vLLM provider with built-in tool parsers (similar to my first example) since we're running multiple models and wanted to keep the parsing logic close to each model's implementation. Made it way easier to test different models and kept our code cleaner by not having to juggle multiple middleware functions, or one big middleware.
That said, your POC with Hermes 3 is super interesting. Let me know if you need help with the parallel tools support you mentioned, we did something similar with llama 3.1 and got parallel tool calling working.
@ShervK That‘s interesting..! The types of tool call templates I currently know include llama, Hermes, and mistral v3 tekken. Do you know anything else about the tool call parser that needs to be implemented?
It would be great if you could define and use this kind of middleware somewhere in the ai sdk library.
@minpeter Your implementation is great! Some slight changes and it can be used with Qwen or Mistral models so it's pretty well setup. However, it might cause some problems for user's who try to hook it up to Llama 3.1 or 3.2 due to Llama's prompt format and how "chatty" it can be.
One problem that kept coming up for us is Llama's format for their code interpreter, doesn't really have an end tag. They do have the <|eom_id|> token as the "end tag" but now you have to ask for the bos tokens for each request and filter them out, which we didn't like.
This also means you have to juggle when llama is using code interpreter vs when its using a built in tool.
You can try and force llama to adhere to your own format for some of this but I've found that it reduces the reliability of it's tool responses, hence why there's so many finetunes for using tools with Llama. I only mention all this because of how popular llama 3 is.
Another thing is that Llama is quite chatty at temperatures higher than 0.5, so sometimes it'll respond like
Sure! Let me do that for you?
{tool call}
and I'll check this too
{tool call}
So our parser also handles these interleaved tool calls. We didn't want to discourage it since our user's liked seeing responses like this, said that it makes it feel more interactive.
Lastly, the smaller Llama models (1B/3B) can be inconsistent with their tool calling.
- Sometimes it'll still include the
<|python_tag|>even if it's a JSON response. - When doing parallel tool calling, it might separate the tool JSON with
;instead of,. - It might have the
```json ... ```markdown code block in it's JSON tool call.
That last point shows up more often with smaller models as well, seen it with Qwen2.5 0.5-7B and Ministral 3B. So might help to include a replace.
toolCallString.forEach((toolCall) => {
toolCall.replace("```json","");
toolCall.replace("```","");
...
}
Anyway, it might be beneficial to include your example for regular JSON based tool formats as well as one for Llama 3, I think it would help a lot of people to include either an example or a note on some of this stuff so they don't go through the same problems I did. I'm happy to help and contribute for this, whether it's writing docs or showing code examples.
@ShervK It would be nice to see some guidance added for small models. If there was an option to "correct" all possible mistakes, it would improve tool calling performance.
To make any random model good for tool use, two challenges need to be solved.
- Based on the tool information included in the system prompt, we will derive a structured output in a parsable format (the structured output method that performs well may vary depending on the model).
- The tool_call will be executed successfully and the results will be provided in a way that the model can understand without conflicting with existing templates.
Assuming that we cannot modify the base tmpl of the model, there is only one thing we can control: (assuming that all base chat templates are constrained templates that only support rendering by system-user-assistant).
Here are the structured output formats that each model I know of does well:
Lllama [Built-in Tools (Brave, Wolfram)]
# for Search
<|python_tag|>
brave_search.call(query="...")
<|eom_id|>
# for Wolfram
<|python_tag|>
wolfram_alpha.call(query="...")
<|eom_id|>
{"name": "get_current_conditions", "parameters": {"location": "San Francisco, CA", "unit": "Fahrenheit"}}<|eot_id|>
Lllama [User-defined Custom tool calling]
<function=spotify_trending_songs>{"n": "5"}</function><|eom_id|>
Qwen, Hermes
<tool_call>
{"arguments": <args-dict>, "name": <function-name>}
</tool_call>
Mistral v3 tokken
[TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "Paris, France", "format": "celsius"}, "id": "VvvODy9mT"}]</s>
First, it should be investigated which of the three supported by the llama model would work best with a custom tool.
@minpeter Just to make sure I understand - are you suggesting we implement a middleware that adds tool calling support to models that don't natively support it?
My original feature request was actually focused on parsing tool calls from models that already support tools but are being served through APIs that don't expose the tool parameter.
While standardizing tool formats across models would be interesting, I think it might be risky since models not trained for tool use could generate unreliable responses.
First of all, it is true that it cannot be used for models that have not learned how to use tools. If a specific API blocks the tools: field, we cannot insert tool-related content into the chat tmpl for that model. So we have to approach it assuming that only system - user - assistant is available. This does not mean that a model that does not learn tool calls can do this, it is just a description of how to bypass the interface to access the model.
For example, let's say you're running mistral v0.3 7b model on a vllm endpoint without adding the --enable-auto-tool-choice --tool-call-parser mistral options.
We know that the mistral model contains rendering logic for the 'tool' role, and for the assistant to make a tool_call, but since vllm doesn't expose it, we need to decorate the tool call as a conversation between the assistant and the user.
This is a story about number 2 mentioned above.
@minpeter Ohhh my mistake, I was getting lost there for a second. I submitted a draft PR for your POC for llama 3 tool parsing to show how we've been doing it, wanted to get opinions on it. I opted for JSON based tool calling as even in the reference llama stack implementation they have the JSON format as the default, I still added support Llama's built in tools. More details in the PR description.
It seems like we can achieve the desired functionality without modifying core functionality. I'll take the time to continue exploring it.
Closing this issue since it's decided that middleware is the way to go, with a link to an example thanks to @minpeter.
Keeping it open in case we want to integrate the middleware into the sdk
I implemented a streaming XML-based tool-calling middleware that uses existing tool calls, maps them into XML, and then emits tool call deltas and full tool calls back to the caller as if they were regular tool calls.
An interesting aspect is that I must remove the tool calls from the API requests to the language models while retaining them for the AI SDK, so it recognizes the emitted tool-call-deltas and tool-calls as valid.
The code before doGenerate retains a reference to the original tools as they were passed to streamText, ensuring they remain valid tools for the language model to call.
I hope the AI SDK remains compatible with this type of tool-calling middleware, but the approach is non-obvious. We probably need more middleware hooks to specifically allow for specialized tool call handling.
Here's roughly how it looks:
import {
LanguageModelV1CallOptions,
Experimental_LanguageModelV1Middleware as LanguageModelV1Middleware,
LanguageModelV1StreamPart,
} from "ai";
import invariant from "tiny-invariant";
export function createXMLToolCallingMiddleware(options: {
strict?: boolean;
}): {
middleware: LanguageModelV1Middleware;
} {
const { strict = true } = options;
const middleware: LanguageModelV1Middleware = {
// Transform the call options before the request is sent
transformParams: async (opts) => {
const { mode } = opts.params;
invariant(mode.type === "regular", "Mode must be regular");
return {
...opts.params,
hasTextChunkTool: mode.tools?.some(t => t.name === "TextChunkTool") || false,
originalTools: mode.tools,
mode: { ...mode, tools: undefined }, // Hide tools from the model
};
},
// Wrap the streaming response to parse tool calls
wrapStream: async ({ doStream, params }) => {
const { hasTextChunkTool, originalTools } = params as any;
const { stream, ...rest } = await doStream();
const transformStream = new TransformStream<LanguageModelV1StreamPart, LanguageModelV1StreamPart>({
async transform(chunk, controller) {
if (chunk.type === "text-delta") {
// Parse text and detect tool calls (simplified)
// Replace with your parsing logic
const parsed = parseToolCalls(chunk.textDelta, originalTools);
parsed.forEach(part => controller.enqueue(part));
} else {
controller.enqueue(chunk);
}
if (!hasTextChunkTool) {
console.warn("TextChunkTool is missing.");
}
},
});
return {
stream: stream.pipeThrough(transformStream),
...rest,
};
},
};
return { middleware };
}
// Placeholder for tool call parsing logic
function parseToolCalls(
text: string,
tools: any[]
): LanguageModelV1StreamPart[] {
// Implement XML parsing and tool call extraction
return [];
}
https://github.com/minpeter/ai-sdk-tool-call-middleware
Here is a Hermes function calling parser that works perfectly in both scenarios (generateText, streamText). I will refine the behavior a bit more and create a PR.
import { createOpenAICompatible } from '@ai-sdk/openai-compatible';
import { wrapLanguageModel, streamText } from 'ai';
import { hermesToolMiddleware } from '@ai-sdk-tool/parser';
const openrouter = createOpenAICompatible({ /* ... */ });
async function main() {
const result = streamText({
model: wrapLanguageModel({
model: openrouter('google/gemma-3-27b-it'),
middleware: hermesToolMiddleware,
}),
system: 'You are a helpful assistant.',
prompt: 'What is the weather in my city?',
maxSteps: 4,
tools: {
get_location: { /* ... */ },
get_weather: { /* ... */ },
},
});
for await (const part of result.fullStream) {
// ...handling text-delta and tool-result...
}
}
main().catch(console.error);
Finally the implementation discussed in this thread is complete, please take a look at the PR and share your thoughts :) https://github.com/vercel/ai/pull/5858
.cc @ShervK
@lgrammel Can we end this issue?
Let's leave it open until AI SDK 5 ships.
Congratulations on the official release!! 🎉
minpeter/ai-sdk-tool-call-middleware
Here is a Hermes function calling parser that works perfectly in both scenarios (
generateText,streamText). I will refine the behavior a bit more and create a PR.import { createOpenAICompatible } from '@ai-sdk/openai-compatible'; import { wrapLanguageModel, streamText } from 'ai'; import { hermesToolMiddleware } from '@ai-sdk-tool/parser';
const openrouter = createOpenAICompatible({ /* ... */ });
async function main() { const result = streamText({ model: wrapLanguageModel({ model: openrouter('google/gemma-3-27b-it'), middleware: hermesToolMiddleware, }), system: 'You are a helpful assistant.', prompt: 'What is the weather in my city?', maxSteps: 4, tools: { get_location: { /* ... / }, get_weather: { / ... */ }, }, });
for await (const part of result.fullStream) { // ...handling text-delta and tool-result... } }
main().catch(console.error);
Pretty cool! But what about generateObject, the model doesn't seem the schema to generate.
Edit: This is a problem with the OpenRouter AI Provider (https://github.com/OpenRouterTeam/ai-sdk-provider/issues/120)
@techwithanirudh Thank you for your interest!! If you have any problems while using it, please leave an issue in the https://github.com/minpeter/ai-sdk-tool-call-middleware repo.
@minpeter stream-handler.ts
import type {
LanguageModelV2StreamPart,
LanguageModelV2,
LanguageModelV2Usage,
LanguageModelV2FinishReason,
} from "@ai-sdk/provider";
import { generateId } from "@ai-sdk/provider-utils";
import { getPotentialStartIndex } from "./utils";
export async function normalToolStream({
doStream,
toolCallTag,
toolCallEndTag,
}: {
doStream: () => ReturnType<LanguageModelV2["doStream"]>;
toolCallTag: string;
toolCallEndTag: string;
}) {
const { stream, ...rest } = await doStream();
let isFirstToolCall = true;
let isFirstText = true;
let afterSwitch = false;
let isToolCall = false;
let buffer = "";
let toolCallIndex = -1;
let toolCallBuffer: string[] = [];
// Track text chunks for start/delta/end pattern
let currentTextId: string | null = null;
let hasEmittedTextStart = false;
const transformStream = new TransformStream<
LanguageModelV2StreamPart,
LanguageModelV2StreamPart
>({
transform(chunk, controller) {
if (chunk.type === "finish") {
// Handle incomplete tool calls by restoring them as text
if (
isToolCall &&
(buffer.length > 0 ||
(toolCallIndex >= 0 && toolCallBuffer[toolCallIndex]))
) {
// Start a new text chunk if needed
if (!currentTextId) {
currentTextId = generateId();
controller.enqueue({
type: "text-start",
id: currentTextId,
});
hasEmittedTextStart = true;
}
// Add the incomplete tool call back as text (without end tag)
const incompleteContent =
(toolCallBuffer[toolCallIndex] || "") + buffer;
controller.enqueue({
type: "text-delta",
id: currentTextId,
delta: toolCallTag + incompleteContent,
});
// Clear the current incomplete tool call from the buffer
if (toolCallIndex >= 0) {
toolCallBuffer = toolCallBuffer.slice(0, toolCallIndex);
}
}
// End any active text chunk before processing tool calls
if (currentTextId && hasEmittedTextStart) {
controller.enqueue({
type: "text-end",
id: currentTextId,
});
currentTextId = null;
hasEmittedTextStart = false;
}
if (toolCallBuffer.length > 0) {
toolCallBuffer.forEach(toolCall => {
// Normalize: if toolCall includes surrounding tags, strip them
let raw = toolCall;
if (raw.startsWith(toolCallTag)) {
raw = raw.slice(toolCallTag.length);
}
if (raw.endsWith(toolCallEndTag)) {
raw = raw.slice(0, -toolCallEndTag.length);
}
// Parse KorinAI XML/text format (only)
try {
const text = raw.trim();
const lines = text
.split(/\r?\n/)
.map(l => l.trim())
.filter(Boolean);
if (lines.length > 0) {
const name = lines[0];
if (/^[A-Za-z0-9_-]+$/.test(name)) {
const args: Record<string, unknown> = {};
// Build remaining text after the first line to support multi-line tag contents
const restIndex = text.indexOf(name) + name.length;
const remainingText = text.slice(restIndex);
// Use a regex to capture <key>...</key> across multiple lines
const tagRe = /<([a-zA-Z0-9_-]+)>([\s\S]*?)<\/\1>/g;
let m;
while ((m = tagRe.exec(remainingText)) !== null) {
const k = m[1];
const v = m[2];
try {
args[k] = JSON.parse(v.trim());
} catch (e) {
args[k] = v.trim();
}
}
// If args only contains a top-level `arguments` key, unwrap it
let inputStr: string;
if (
Object.keys(args).length === 1 &&
Object.prototype.hasOwnProperty.call(args, "arguments")
) {
const v = (args as any)["arguments"];
inputStr = typeof v === "string" ? v : JSON.stringify(v);
} else {
inputStr = JSON.stringify(args);
}
controller.enqueue({
type: "tool-call",
toolCallId: generateId(),
toolName: name,
input: inputStr,
});
return;
}
}
} catch (e) {
// parsing failed
}
// If parsing failed or name invalid, restore original text
console.error(`Error parsing tool call: ${toolCall}`);
const errorId = generateId();
controller.enqueue({ type: "text-start", id: errorId });
controller.enqueue({
type: "text-delta",
id: errorId,
delta: `${toolCallTag}${toolCall}${toolCallEndTag}`,
});
controller.enqueue({ type: "text-end", id: errorId });
});
}
// stop token
controller.enqueue(chunk);
return;
} else if (chunk.type !== "text-delta") {
controller.enqueue(chunk);
return;
}
buffer += chunk.delta;
function publish(text: string) {
if (text.length > 0 || isToolCall) {
const prefix =
afterSwitch && (isToolCall ? !isFirstToolCall : !isFirstText)
? "\n" // separator
: "";
if (isToolCall) {
// End any active text chunk when switching to tool call
if (currentTextId && hasEmittedTextStart) {
controller.enqueue({
type: "text-end",
id: currentTextId,
});
currentTextId = null;
hasEmittedTextStart = false;
}
if (!toolCallBuffer[toolCallIndex]) {
toolCallBuffer[toolCallIndex] = "";
}
toolCallBuffer[toolCallIndex] += text;
} else if (text.length > 0) {
// Start a new text chunk if needed
if (!currentTextId) {
currentTextId = generateId();
controller.enqueue({
type: "text-start",
id: currentTextId,
});
hasEmittedTextStart = true;
}
controller.enqueue({
type: "text-delta",
id: currentTextId,
delta: prefix + text,
});
}
afterSwitch = false;
if (isToolCall) {
isFirstToolCall = false;
} else {
isFirstText = false;
}
}
}
do {
const nextTag = isToolCall ? toolCallEndTag : toolCallTag;
const startIndex = getPotentialStartIndex(buffer, nextTag);
// no opening or closing tag found, publish the buffer
if (startIndex == null) {
publish(buffer);
buffer = "";
break;
}
const foundFullMatch = startIndex + nextTag.length <= buffer.length;
if (foundFullMatch) {
// publish text before the tag
publish(buffer.slice(0, startIndex));
buffer = buffer.slice(startIndex + nextTag.length);
toolCallIndex++;
isToolCall = !isToolCall;
afterSwitch = true;
} else {
// Partial match found, wait for more data to complete the tag.
break;
}
} while (true);
},
});
return {
stream: stream?.pipeThrough(transformStream) ?? new ReadableStream(),
...rest,
};
}
// TODO: Modify tool calls to be streamed
export async function toolChoiceStream({
doGenerate,
}: {
doGenerate: () => ReturnType<LanguageModelV2["doGenerate"]>;
}) {
const result = await doGenerate();
// Assume result.content[0] contains tool-call information; try JSON or KorinAI XML/text
let toolName = "unknown";
let toolArgs: Record<string, unknown> = {};
if (result?.content && result.content.length > 0 && result.content[0]?.type === "text") {
const text = result.content[0].text;
const lines = text.split(/\r?\n/).map(l => l.trim()).filter(Boolean);
if (lines.length > 0 && /^[A-Za-z0-9_-]+$/.test(lines[0])) {
toolName = lines[0];
const tagRe = /<([a-zA-Z0-9_-]+)>([\s\S]*?)<\/\1>/g;
let m;
const restIndex = text.indexOf(lines[0]) + lines[0].length;
const remainingText = text.slice(restIndex);
while ((m = tagRe.exec(remainingText)) !== null) {
const k = m[1];
const v = m[2];
try {
toolArgs[k] = JSON.parse(v.trim());
} catch (e) {
toolArgs[k] = v.trim();
}
}
}
}
const toolCallChunk: LanguageModelV2StreamPart = {
type: "tool-call",
toolCallId: generateId(),
toolName,
input: JSON.stringify(toolArgs || {}),
};
const finishChunk: LanguageModelV2StreamPart = {
type: "finish",
usage:
result?.usage ||
// TODO: If possible, try to return a certain amount of LLM usage.
({
inputTokens: 0,
outputTokens: 0,
totalTokens: 0,
} as LanguageModelV2Usage),
finishReason: "tool-calls" as LanguageModelV2FinishReason,
};
const stream = new ReadableStream<LanguageModelV2StreamPart>({
start(controller) {
controller.enqueue(toolCallChunk);
controller.enqueue(finishChunk);
controller.close();
},
});
return {
request: result?.request || {},
response: result?.response || {},
stream,
};
}
index.ts
import { createToolMiddleware } from "./tool-call-middleware";
const korinaiToolMiddleware = createToolMiddleware({
toolSystemPromptTemplate(tools: string) {
return `You are KorinAI, a function-calling AI model.
You are provided with function signatures within <tools></tools> XML tags.
You may call one or more functions to assist with the user query.
Don't make assumptions about what values to plug into functions.
Here are the available tools: <tools>${tools}</tools>
For each function call return the call wrapped in <tool_call>...</tool_call> tags and nothing else.
Example KorinAI-style call (text form):
<tool_call>
get_wheather
<location>
San Fransisco
</location>
</tool_call>`;
},
toolCallTag: "<tool_call>",
toolCallEndTag: "</tool_call>",
toolResponseTag: "<tool_response>",
toolResponseEndTag: "</tool_response>",
});
export { korinaiToolMiddleware, createToolMiddleware };
i change to be that, but the tool name is not being parsed, tool-call:
{
"type": "tool-\nterminal_run\n<command>\necho \"Computer is active - $(date)\"\n</command>\n<runInBackground>\nfalse\n</runInBackground>",
"toolCallId": "call_202508271705064ae14a962b8d4a13_0",
"state": "output-error",
"rawInput": "{}",
"errorText": "Model tried to call unavailable tool '\nterminal_run\n<command>\necho \"Computer is active - $(date)\"\n</command>\n<runInBackground>\nfalse\n</runInBackground>'. Available tools: file_write, file_patch, file_read, file_list, file_search, file_download, terminal_run, terminal_get, terminal_send, routing_to_agent."
},
can you help? Thanks
@sijawara Please leave an issue at https://github.com/minpeter/ai-sdk-tool-call-middleware and I will be happy to help.
@lgrammel Should we merge it into the AI SDK now?