baml icon indicating copy to clipboard operation
baml copied to clipboard

JSON schema support for dynamic types

Open sxlijin opened this issue 1 year ago • 16 comments

As of right now BAML doesnt have an official release for JSON schema in type-builder. this is because there are so many different ways to serialize a JSON schema into a type and we don't want to be too opinonated here. Instead we offer a reference implementation in this repository that you can use to leverage python / pydnatic models with typebuilder

https://github.com/BoundaryML/baml-examples/tree/main/json-schema-to-baml


PREVIOUS NOTES (not implemented - please use reference above!)

We have a working implementation of this in #655 that allows users to inject JSON schemas into TypeBuilder using the following syntaxes. We're current holding off on merging this, though, because JSON schema is a very complex format and we don't have any users asking for this. If you're interested in trying this out, please let us know and we'll merge this in and make sure this will work for your use case!

Python

class Person(pydantic.BaseModel):
    last_name: list[str]
    height: Optional[float] = pydantic.Field(description="Height in meters")

tb = TypeBuilder()
tb.unstable_features.add_json_schema(Person.model_json_schema())

res = await b.ExtractPeople(
    "My name is Harrison. My hair is black and I'm 6 feet tall. I'm pretty good around the hoop. I like giraffes.",
    {"tb": tb},
)

TypeScript

const personSchema = z.object({
  animalLiked: z.object({
    animal: z.string().describe('The animal mentioned, in singular form.'),
  }),
  hobbies: z.enum(['chess', 'sports', 'music', 'reading']).array(),
  height: z.union([z.string(), z.number().int()]).describe('Height in meters'),
})

let tb = new TypeBuilder()
tb.unstableFeatures.addJsonSchema(zodToJsonSchema(personSchema, 'Person'))

const res = await b.ExtractPeople(
  "My name is Harrison. My hair is black and I'm 6 feet tall. I'm pretty good around the hoop. I like giraffes.",
  { tb },
)

sxlijin avatar Jul 10 '24 17:07 sxlijin

We'd love to use this feature.

We use Pydantic extensively, including at our ORM layer. This would allow us to continue defining types ergonomically in Python, and have them available in BAML functions.

arunbahl avatar Aug 15 '24 20:08 arunbahl

Same! For those of us with a bunch of zod schemas already built out, its a lot easier to adopt BAML going forward if we don't have to big bang migrate everything over to BAML in one go, and can instead do it a little bit at a time and intermingle our existing zod schemas. This'd be great!

airhorns avatar Sep 04 '24 01:09 airhorns

Here's a question about an alternative approach (sorry for the delayed response, @arunbahl !): would you be interested in something that could take your pydantic/zod schemas and generate BAML code from them?

Part of the reason we haven't shipped this yet is because:

  • part of the value proposition of using BAML, we believe, is the developer experience

    • you get live previews of their prompts as you edit them;
    • you get type-checking in your prompt templates as you expand them; and
    • you can define tests for your prompts right next to them, without having to write a bunch of pytest boilerplate

    all because your prompts are written in BAML

  • TypeBuilder is meant for types that must be defined on-the-fly, whereas most Pydantic and Zod schemas that we've seen are just defined statically

sxlijin avatar Sep 05 '24 02:09 sxlijin

I would also love to use this feature / it is kind of necessary for my use case to fully embrace BAML. And the alternative approach @sxlijin would not work for my use cases. For context, I am working on an agentic framework and using BAML for my prompting backend. Basically, there are two key reasons I need this feature: (1) I am developing a Python library for other developers working with LLM agents, and part of this involves the developers providing their own schemas which eventually get passed to BAML dynamic return outputs on my backend. Without this feature, this isn't really possible without inventing my own schema system to convert properly to the necessary dynamic BAML types with the TypeBuilder. With this feature, I could enable developers to provide schemas in their preferred form - JSON schema, Pydantic, or with the BAML TypeBuilder if they wanted. (2) I need to be able to serialize my dynamic output schemas. This is much easier if they are represented by Pydantic objects or JSON schemas, and doesn't currently seem very possible with TypeBuilder.

The alternative approach would not work for me because: (1) It would complicate how I enable developers to provide JSON schemas - having to convert to BAML first as opposed to just passing it in when I make the BAML call. (2) Some of my schemas may be generated at runtime, e.g. as a derived result from other LLM calls - meaning I would not be able to create a corresponding Pydantic schema beforehand.

anerli avatar Oct 23 '24 19:10 anerli

On a related note, having serde options for TypeBuilder and FieldType objects would be a partial solve for my use case. I think currently having the JSON support for dynamic types solves my use case better because of the first point I mentioned. However, being able to serialize and de-serialize TypeBuilder objects and field types would also add a lot of flexibility for me - not sure if other people would make use of this. In general my use case necessitates the use of a lot of dynamic types - so having maximum flexibility with how I work with them adds a lot of value to me.

anerli avatar Oct 23 '24 19:10 anerli

jfan via Discord wants this: https://discord.com/channels/1119368998161752075/1316873547078959155

sxlijin avatar Dec 12 '24 21:12 sxlijin

we have a community-contributed solution to this you can check out here: https://github.com/BoundaryML/baml-examples/tree/main/json-schema-to-baml

aaronvg avatar Dec 17 '24 20:12 aaronvg

I love BAML and this feature is absolutely needed for me to integrate with the product I am working on. Currently I rely on Tool Calling or instructor to do this.

ezhilvendhan avatar Feb 23 '25 06:02 ezhilvendhan

I love BAML and this feature is absolutely needed for me to integrate with the product I am working on. Currently I rely on Tool Calling or instructor to do this.

Hey @ezhilvendhan the recommended approach here is to use the method here: https://github.com/BoundaryML/baml-examples/tree/main/json-schema-to-baml

hellovai avatar Feb 23 '25 06:02 hellovai

Thank you @hellovai. How about this approach defined in the docs? Are they same?

import pydantic
from baml_client import b

class Person(pydantic.BaseModel):
    last_name: list[str]
    height: Optional[float] = pydantic.Field(description="Height in meters")

tb = TypeBuilder()
tb.unstable_features.add_json_schema(Person.model_json_schema())

res = await b.ExtractPeople(
    "My name is Harrison. My hair is black and I'm 6 feet tall. I'm pretty good around the hoop. I like giraffes.",
    {"tb": tb},
)

ezhilvendhan avatar Feb 23 '25 06:02 ezhilvendhan

i should update the docs! tb.unstable_features.add_json_schema isn't a thing!

the repo i linked, however, does have something equivalent to it!

see the definition of the parse_json_schema in that repo! you can just copy and paste the method over and things will "just work"

def parse(raw_text: str):
    tb = TypeBuilder()
    res = parse_json_schema(Resume.model_json_schema(), tb)
    # DynamicContainer is the OutputType of the baml function ExtractDynamicTypes
    tb.DynamicContainer.add_property("data", res)
    response = b.ExtractDynamicTypes(raw_text, {"tb": tb})

hellovai avatar Feb 23 '25 07:02 hellovai

My use case is that many of my structures / schemas are defined at runtime and constructed as part of the user workflow. I use the Schema class from Effect (which is more or less the same as Zod) as it can be serialized to/from json schema, provides type inference with support for Branded Types / runtime placeholders, and conforms to the Standard Schema Spec.

I'm on the fence to use BAML and justify the tooling (nice for this stage 👏 , but forces vscode 😢 ), custom syntax, etc. But the first thing I would have to do is write a generator using Type Builder for those schemas.

awhiteside1 avatar Mar 20 '25 23:03 awhiteside1

@awhiteside1 which IDE do you use? We have an LSP we are literally about to finish writing, which should work with other editors.

aaronvg avatar Mar 20 '25 23:03 aaronvg

@awhiteside1 which IDE do you use? We have an LSP we are literally about to finish writing, which should work with other editors.

👋 jetbrains, usually webstorm or idea. Happy to beta test!

awhiteside1 avatar Mar 21 '25 02:03 awhiteside1

Are there plans to roll this feature into BAML directly? This is required for integration with, for example, MCP tools (along with some minor changes to the linked community code).

htxryan avatar Apr 22 '25 14:04 htxryan

I’m working with a database containing several hundred thousand regex match patterns (possibly up to a million). Loading all patterns at once in BAML introduces performance bottlenecks to my understanding.

To address this, I’ve designed a pipeline that narrows down a large set of patterns to a small, relevant subset for each input text. This filtered list is then used to extract type-safe data with higher precision.

My use-case involves parsing commands from a network router CLI. Commands are composed of multiple tokens/words, where some are static predefined text and others are variables matched via regex. For example:

show interfaces ethernet eth5 identify

Here, eth5 is a variable. A corresponding pattern might look like show interfaces ethernet {{ InterfaceName }} identify and the regex for the InterfaceName would be something like:

"InterfaceName": {
  "pattern": "^(eth|wg|tun|vtun|vti|br|bond|dum|lo|pppoe|peth|vxlan|geneve|erspan|gre|gretap|ip6gre|ip6gretap|ipip|sit|l2tp|pptp|sstp|wwan|macsec|pseudo-ethernet)[0-9]+(\\.[0-9]+)?$",
  "description": "VyOS interface name",
  "examples": ["eth0", "wg0", "br0", "eth0.100"]
}

The matching retreval process might include ambeguity and provide a few patterns, not just one for a given command. So even if one actually fully matches or even a few patterns matches to a given command, its a success.

With a much smaller, targeted list of patterns per command, I plan to dynamically generate schemas using TypeBuilder, instruct the LLM on the command list, and extract type-safe matches—ignoring cases where some patterns may not match.

qdrddr avatar Nov 06 '25 17:11 qdrddr