Built-In Multimodality, Structured Outputs, and tool calling Syntax for llm_generate (OpenAI & vLLM/SGLang)
Is your feature request related to a problem?
TLDR:
Current llm_generate usage in Daft is limited to flat string prompts and the chat interface with no way to override message history. This limits Daft’s ability to support:
- Multimodal inputs (images, audio, URIs).
- Structured outputs (Pydantic / JSON schema, regex, EBNF).
- Tool calling and assistant–tool role message flows.
- Provider-specific differences (OpenAI vs. vLLM/SGLang/TensorRT-LLM).
Without this, developers have to drop down into raw clients instead of composing Daft-native pipelines.
Details
This issue details the requirements for both OpenAI (via OpenAI API) and OpenAI-Compatible Inference provider usage patterns for tool use, structured generation, and multimodal inputs.
Now that we have AsyncOpenAI client calls built into the llm_generate function, daft is well on its way to providing all of the major features for LLM workloads. Adding support for structured generation is a powerful feature but requires nuanced support for specific implementations of inference providers and servers syntax. The implications of supporting structured generation are wide-reaching and meaningful for the daft community.
Because I know people will ask, Structured Generation has matured considerably since Jason Liu's "Pydantic is all you need talk" at the AI Engineer World Fair back in 2024. Structured output engines like Outlines, XGrammar, and Guidance, have formalized the approach and inference engines like SGLang and vLLM have added support for each engine. Outlines tends to be used by default. Each come with their own strengths and weaknesses. As it currently stands, guidance, with it's core engine written in Rust, stands out as the performance leader.
Leading inference engines vLLM, SGLang, and NVIDIA TensorRT-LLM have been quick to support each of the major structured output engines which can be easily toggled with parameters at the cli. Luckily, all three engines use the exact same syntax for structured outputs when using the OpenAI-client. This works perfectly with the llm_generate function which uses the AsyncOpenAI client.
MOST CRITICALLY, structured outputs on inference servers (like vLLM) have different argument support requirements than vanilla OpenAI. The current llm_generate function strictly implements the chat interface with an unoverridable chat history which limits the flexibility and capability of the function.
I don't think its a stretch to say that daft can and should support both simple completions and full chat workloads, which brings us to understanding the canonical shape of OpenAI Messages:
Messages Array
Each element is a message object with at least a role and content.
Valid roles are:
"system""user""assistant""tool"(used to return tool outputs back into the thread)
Message object
{
"role": "system" | "user" | "assistant" | "tool",
"content": string | [ content_part, ... ],
"name": string?, // optional name (e.g. function/tool name)
"tool_call_id": string? // used when replying to a specific tool call
}
Content
- Simplest form: a single string, e.g.
"Hello!". - Rich form: a list of “content parts.” This is how you do multimodal.
Content part object
// text
{ "type": "text", "text": "hello world" }
// images
{ "type": "image_url", "image_url": { "url": "https://..." } }
// audio (via URL)
{ "type": "audio_url", "audio_url": { "url": "https://..." } }
// audio (inline base64, “input_audio”)
{ "type": "input_audio",
"input_audio": { "data": "<base64 string>", "format": "wav" } }
Currently documented content types:
"text""image_url""input_audio"(base64 + format)"audio_url"(URL, supported in some contexts)"tool_call"(returned by assistant when it wants to call a tool)
Tool calling (assistant message special case)
When the assistant wants to call a tool, the message has tool_calls instead of content:
{
"role": "assistant",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{ \"location\": \"Chicago\" }" // JSON string
}
}
]
}
Tool role messages
When you return the tool’s output, you push a message with role:"tool":
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "72°F and sunny"
}
This keeps the thread consistent.
Putting it all together
A typical multimodal + tool call thread might look like:
[
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": [
{ "type": "text", "text": "What’s in this picture?" },
{ "type": "image_url", "image_url": { "url": "https://..." } }
]
},
{ "role": "assistant", "content": "It looks like a golden retriever." },
{ "role": "assistant", "tool_calls": [
{ "id": "abc123", "type": "function",
"function": { "name": "dog_facts", "arguments": "{ \"breed\": \"golden retriever\" }" }
}
]
},
{ "role": "tool", "tool_call_id": "abc123", "content": "Golden retrievers are friendly." }
]
The schema is deliberately loose:
roleis required.contentcan be a string or a list of parts.- Extra fields (
name,tool_call_id,tool_calls) appear only in special cases.
Finally, I want to acknowledge that as of GPT-5, OpenAI has introduced a new structured outputs format based on xml called harmony. Harmony will have its own set of requirements and should be addressed in another issue.
Describe the solution you'd like
I propose extending llm_generate to:
-
Unify inputs
-
Accept
Expressionfor:prompt(simple strings → completions).messages(list[Struct] → chat).image,audio(Daftdt.Image,dt.Audio, or URI strings).
-
Collapse all multimodal parts into OpenAI-compatible
messages[].content.
-
-
Support structured outputs
- Add
response_model: BaseModel | Nonefor Pydantic validation. - Pass structured generation args (
json_schema,guided_json,regex, etc.) transparently to both OpenAI and inference servers.
- Add
-
Provider parity
- For
provider="openai": route throughAsyncOpenAI.chat.completions.create. - For
provider="vllm" | "sglang": ensure structured-output args are normalized to the shared OpenAI-compatible syntax (guided_json,response_format, etc.).
- For
-
Tool calling
- Allow messages with
tool_callsandrole="tool". - Ensure outputs that include tool calls are preserved in full, not truncated to
message.content.
- Allow messages with
-
Flat vs. chat APIs
-
Provide two thin entrypoints on top of
llm_generate:llm_complete→ simple flat prompt string.llm_chat→ full chat history + multimodal content.
-
Both funnel down into the unified
llm_generatecore.
-
Describe alternatives you've considered
- Keeping OpenAI-only: limits Daft’s ability to serve as a unifying pipeline layer.
- Separate functions per provider (
llm_generate_openai,llm_generate_vllm): hurts composability, users need to branch logic. - Third-party wrappers (Instructor, Guidance, Outlines): powerful but add more dependencies; Daft’s core should remain provider-agnostic and thin.
Additional Context
Testing strategy
-
Unit tests
- Prompt-only (
llm_complete) → ensures backwards compatibility. - Messages with multimodal content → confirm proper OpenAI schema.
- Structured output → JSON schema validation with simple
pydantic.BaseModel. - Tool call messages → assistant tool call + tool reply roundtrip.
- Prompt-only (
-
Integration tests
- Run against
provider="openai"with mock client. - Run against
provider="vllm"with OpenAI-compatible local server. - Validate same schema works across providers.
- Run against
-
Property tests
- Ensure
messages[].contentalways serializes to valid JSON accepted by OpenAI schema.
- Ensure
OpenAI Structured Outputs
OpenAI API Compatible Structured Outputs via Inference Engines
- SGLang Structured Outputs, Tool Calling, Multimodal
- vLLM Structured Outputs, Tool Calling, Multimodal
- OpenRouter Structured Outputs, Tool Calling, Multimodal
Structured Outputs Engines
Related issues and discussions:
https://github.com/Eventual-Inc/Daft/issues/1885 https://github.com/Eventual-Inc/Daft/discussions/2774
Would you like to implement a fix?
Yes
Thank you for the detailed write-up!
Seems we're aligned 😉
I'm working on OpenAI embeddings at the moment, and can work on OpenAI (and others) structured output next!
class PromptResponse(BaseModel):
rating: conint(ge=0,le=5) = Field(...,description="Moving rating 0-5 used to display 'stars' in a UI.")
category: Literal["Action", "Drama", "Comedy"] = Field(...,description="A generic movie category for tags.")
df = df.with_column("response", prompt(
messages=col("messages"), # messages is a list[Message] type.
return_format=PromptResponse,
provider="openai",
model="gpt-5-mini",
I've been playing with some utilities for the message building and arg handling its been pretty productive. I'm working on some tests right now.
Full disclosure this targets OpenAI API only for now to get the overall structure down. We can add base_url aware flags to update how structured gen gets fed into clients once we get to inference servers.
One thing I didn't cover above is pydantic validation. Personally I think this should happen optionally as a separate step post inference.
Can't wait to see how you handle the Message DType!
@everettVT likely just DataType.struct since that's what we use for all record types. I should be wrapping up #4997 soon which sets up for this.
As of recently a majority of the features covered in this issue have been covered by prompt
The only item that remains is multi-turn conversation history which requires the variant message dtype, currently a lower priority.