Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Built-In Multimodality, Structured Outputs, and tool calling Syntax for llm_generate (OpenAI & vLLM/SGLang)

Open everettVT opened this issue 4 months ago • 4 comments

Is your feature request related to a problem?

TLDR:

Current llm_generate usage in Daft is limited to flat string prompts and the chat interface with no way to override message history. This limits Daft’s ability to support:

  • Multimodal inputs (images, audio, URIs).
  • Structured outputs (Pydantic / JSON schema, regex, EBNF).
  • Tool calling and assistant–tool role message flows.
  • Provider-specific differences (OpenAI vs. vLLM/SGLang/TensorRT-LLM).

Without this, developers have to drop down into raw clients instead of composing Daft-native pipelines.


Details

This issue details the requirements for both OpenAI (via OpenAI API) and OpenAI-Compatible Inference provider usage patterns for tool use, structured generation, and multimodal inputs.

Now that we have AsyncOpenAI client calls built into the llm_generate function, daft is well on its way to providing all of the major features for LLM workloads. Adding support for structured generation is a powerful feature but requires nuanced support for specific implementations of inference providers and servers syntax. The implications of supporting structured generation are wide-reaching and meaningful for the daft community.

Because I know people will ask, Structured Generation has matured considerably since Jason Liu's "Pydantic is all you need talk" at the AI Engineer World Fair back in 2024. Structured output engines like Outlines, XGrammar, and Guidance, have formalized the approach and inference engines like SGLang and vLLM have added support for each engine. Outlines tends to be used by default. Each come with their own strengths and weaknesses. As it currently stands, guidance, with it's core engine written in Rust, stands out as the performance leader.

Leading inference engines vLLM, SGLang, and NVIDIA TensorRT-LLM have been quick to support each of the major structured output engines which can be easily toggled with parameters at the cli. Luckily, all three engines use the exact same syntax for structured outputs when using the OpenAI-client. This works perfectly with the llm_generate function which uses the AsyncOpenAI client.

MOST CRITICALLY, structured outputs on inference servers (like vLLM) have different argument support requirements than vanilla OpenAI. The current llm_generate function strictly implements the chat interface with an unoverridable chat history which limits the flexibility and capability of the function.

I don't think its a stretch to say that daft can and should support both simple completions and full chat workloads, which brings us to understanding the canonical shape of OpenAI Messages:

Messages Array

Each element is a message object with at least a role and content. Valid roles are:

  • "system"
  • "user"
  • "assistant"
  • "tool" (used to return tool outputs back into the thread)

Message object

{
  "role": "system" | "user" | "assistant" | "tool",
  "content": string | [ content_part, ... ],
  "name": string?,         // optional name (e.g. function/tool name)
  "tool_call_id": string?  // used when replying to a specific tool call
}

Content

  • Simplest form: a single string, e.g. "Hello!".
  • Rich form: a list of “content parts.” This is how you do multimodal.

Content part object

// text
{ "type": "text", "text": "hello world" }

// images
{ "type": "image_url", "image_url": { "url": "https://..." } }

// audio (via URL)
{ "type": "audio_url", "audio_url": { "url": "https://..." } }

// audio (inline base64, “input_audio”)
{ "type": "input_audio",
  "input_audio": { "data": "<base64 string>", "format": "wav" } }

Currently documented content types:

  • "text"
  • "image_url"
  • "input_audio" (base64 + format)
  • "audio_url" (URL, supported in some contexts)
  • "tool_call" (returned by assistant when it wants to call a tool)

Tool calling (assistant message special case)

When the assistant wants to call a tool, the message has tool_calls instead of content:

{
  "role": "assistant",
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "get_weather",
        "arguments": "{ \"location\": \"Chicago\" }"   // JSON string
      }
    }
  ]
}

Tool role messages

When you return the tool’s output, you push a message with role:"tool":

{
  "role": "tool",
  "tool_call_id": "call_abc123",
  "content": "72°F and sunny"
}

This keeps the thread consistent.


Putting it all together

A typical multimodal + tool call thread might look like:

[
  { "role": "system", "content": "You are a helpful assistant." },
  { "role": "user", "content": [
      { "type": "text", "text": "What’s in this picture?" },
      { "type": "image_url", "image_url": { "url": "https://..." } }
    ]
  },
  { "role": "assistant", "content": "It looks like a golden retriever." },
  { "role": "assistant", "tool_calls": [
      { "id": "abc123", "type": "function",
        "function": { "name": "dog_facts", "arguments": "{ \"breed\": \"golden retriever\" }" }
      }
    ]
  },
  { "role": "tool", "tool_call_id": "abc123", "content": "Golden retrievers are friendly." }
]

The schema is deliberately loose:

  • role is required.
  • content can be a string or a list of parts.
  • Extra fields (name, tool_call_id, tool_calls) appear only in special cases.

Finally, I want to acknowledge that as of GPT-5, OpenAI has introduced a new structured outputs format based on xml called harmony. Harmony will have its own set of requirements and should be addressed in another issue.

Describe the solution you'd like

I propose extending llm_generate to:

  1. Unify inputs

    • Accept Expression for:

      • prompt (simple strings → completions).
      • messages (list[Struct] → chat).
      • image, audio (Daft dt.Image, dt.Audio, or URI strings).
    • Collapse all multimodal parts into OpenAI-compatible messages[].content.

  2. Support structured outputs

    • Add response_model: BaseModel | None for Pydantic validation.
    • Pass structured generation args (json_schema, guided_json, regex, etc.) transparently to both OpenAI and inference servers.
  3. Provider parity

    • For provider="openai": route through AsyncOpenAI.chat.completions.create.
    • For provider="vllm" | "sglang": ensure structured-output args are normalized to the shared OpenAI-compatible syntax (guided_json, response_format, etc.).
  4. Tool calling

    • Allow messages with tool_calls and role="tool".
    • Ensure outputs that include tool calls are preserved in full, not truncated to message.content.
  5. Flat vs. chat APIs

    • Provide two thin entrypoints on top of llm_generate:

      • llm_complete → simple flat prompt string.
      • llm_chat → full chat history + multimodal content.
    • Both funnel down into the unified llm_generate core.

Describe alternatives you've considered

  • Keeping OpenAI-only: limits Daft’s ability to serve as a unifying pipeline layer.
  • Separate functions per provider (llm_generate_openai, llm_generate_vllm): hurts composability, users need to branch logic.
  • Third-party wrappers (Instructor, Guidance, Outlines): powerful but add more dependencies; Daft’s core should remain provider-agnostic and thin.

Additional Context

Testing strategy

  1. Unit tests

    • Prompt-only (llm_complete) → ensures backwards compatibility.
    • Messages with multimodal content → confirm proper OpenAI schema.
    • Structured output → JSON schema validation with simple pydantic.BaseModel.
    • Tool call messages → assistant tool call + tool reply roundtrip.
  2. Integration tests

    • Run against provider="openai" with mock client.
    • Run against provider="vllm" with OpenAI-compatible local server.
    • Validate same schema works across providers.
  3. Property tests

    • Ensure messages[].content always serializes to valid JSON accepted by OpenAI schema.

OpenAI Structured Outputs

OpenAI API Compatible Structured Outputs via Inference Engines

Structured Outputs Engines

Related issues and discussions:

https://github.com/Eventual-Inc/Daft/issues/1885 https://github.com/Eventual-Inc/Daft/discussions/2774

Would you like to implement a fix?

Yes

everettVT avatar Aug 19 '25 19:08 everettVT

Thank you for the detailed write-up!

Seems we're aligned 😉

I'm working on OpenAI embeddings at the moment, and can work on OpenAI (and others) structured output next!

class PromptResponse(BaseModel):
    rating: conint(ge=0,le=5) = Field(...,description="Moving rating 0-5 used to display 'stars' in a UI.")
    category: Literal["Action", "Drama", "Comedy"] = Field(...,description="A generic movie category for tags.")

df = df.with_column("response", prompt(
    messages=col("messages"), # messages is a list[Message] type.
    return_format=PromptResponse,
    provider="openai",
    model="gpt-5-mini",

rchowell avatar Aug 19 '25 19:08 rchowell

I've been playing with some utilities for the message building and arg handling its been pretty productive. I'm working on some tests right now.

Full disclosure this targets OpenAI API only for now to get the overall structure down. We can add base_url aware flags to update how structured gen gets fed into clients once we get to inference servers.

One thing I didn't cover above is pydantic validation. Personally I think this should happen optionally as a separate step post inference.

everettVT avatar Aug 19 '25 19:08 everettVT

Can't wait to see how you handle the Message DType!

everettVT avatar Aug 21 '25 22:08 everettVT

@everettVT likely just DataType.struct since that's what we use for all record types. I should be wrapping up #4997 soon which sets up for this.

rchowell avatar Aug 22 '25 01:08 rchowell

As of recently a majority of the features covered in this issue have been covered by prompt

The only item that remains is multi-turn conversation history which requires the variant message dtype, currently a lower priority.

everettVT avatar Nov 21 '25 18:11 everettVT