Feature Request: Support for tool calling in llm adapters
Language models use JSON Schema, MCP uses JSON Schema, OpenAPI uses JSON Schema, but livekit uses python functions. This creates a mismatch between the way tools actually interact with LLMs and the livekit API that makes it challenging to support things like MCP, OpenAPI, etc. without mapping the schema to a python function and back again as is done in the MCP sample here:
https://github.com/livekit-examples/basic-mcp
While this sample attempts to get the job done, it's full of all sorts of complicated code that shouldn't need to be written in the first place. It only exists because of the constraint that tools must be python functions, which is not an LLM native requirement, it's a LiveKit imposed requirement.
With the release of the Responses API, OpenAI has also added an alternative tool calling model for built in tools, and it's not clear how to plug those tools into a Livekit voice agent due to the way tool calling is implemented.
In a perfect world, the tool calling process would be something overridable / customizable by the llm adapter itself or a tool calling adapter, so someone could create an adapter that better integrates with native LLM capabilities or standards like MCP, instead of having to manage a lossy conversion to python functions and back again. At the moment, tool calling is implemented in a central set of functions that the llm adapter is not even involved in, making it very hard to customize tool calling to use functionality LLMs already support without forking the entire SDK:
https://github.com/livekit/agents/blob/4c3d980d287b93d1fb4417f35098e33f178d2128/livekit-agents/livekit/agents/voice/agent_activity.py#L1315
I filed an issue yesterday https://github.com/livekit/agents/issues/1955 that is a manifestation of tools being represented by python functions in Livekit. I'm in favor of any way that allows us to specify the full breadth of JSON schema.
We have a pretty large set of tools we've built and have been using a pydantic model to define the arguments. It's worked very well - you get built in validation, field descriptions without parsing docstrings, etc. It may not be the perfect data model for defining functions but I think it's definitely a closer representation of JSON schema than a plain old python function.
For what it's worth most of the other frameworks (OpenAI agents, pydantic AI, langchain, etc) all use the python function model but I haven't been able to track down "why".
The interesting case is https://github.com/livekit/agents/issues/1955#issuecomment-2795672931 and https://github.com/livekit/agents/blob/86017364d3cd50f08397311d66fc47788279820b/livekit-agents/livekit/agents/llm/utils.py#L172-L174 shows that internally it's transformed to a pydantic model anyway in at least for OpenAI and Anthropic LLMs.
Proposing a couple of different directions:
- Update
FunctionToolso it maintains an internal pydantic representation of the function. see existing here - Maybe add another decorator fn that builds FunctionTool from a different signature (something like
ctx: ToolContext, args: PydanticModel)
This is similar to what vercel do but with Zod instead of pydantic, and they also support optionally manually defining the json schema.
Very happy to put together a PR
@theomonnom saw your PR 🙌
Any chance raw function tools could support context - similar to normal function tools? Happy to put up a quick PR iterating on
https://github.com/elyosenergy/agents/blob/c839ac2fcbdf1b80f159095e4ea22f676aa1f05e/livekit-agents/livekit/agents/voice/generation.py#L341-L350
Couldn't help myself https://github.com/livekit/agents/pull/2073
@theomonnom saw your PR 🙌 Any chance raw function tools could support context - similar to normal function tools? Happy to put up a quick PR iterating on https://github.com/elyosenergy/agents/blob/c839ac2fcbdf1b80f159095e4ea22f676aa1f05e/livekit-agents/livekit/agents/voice/generation.py#L341-L350
Thanks, I moved your PR here
@mnbbrown Yeah, pydantic is pretty decent. Not as good as having direct access to the lower levels for a lot of cases, but it's really hard to beat for the higher level interface.
I just started implementing some tool calling and I notice this pattern seems to work well so far:
class MyModel(BaseModel):
field_1: list[str] = Field(description='...')
field_2: SomeEnum = Field(description='...')
class MyAgent(Agent):
@function_tool
def my_tool(self, context: RunContext, special_object: MyModel): ...
Any downsides or gotchas I should be aware of about doing it this way, as opposed to specifying individual fields as tool function arguments?
There is a maximum depth on OpenAI schemas of 5 layers, so something to keep in mind if you start doing anything with deeper nesting.