msgspec
msgspec copied to clipboard
Support json-schema generation
It would be helpful if JSON schema generation was supported
class User(msgspec.Struct):
"""A new type describing a User"""
name: str
groups: Set[str] = set()
email: Optional[str] = None
schema = User.json_schema()
Similar to the functionality seen in https://github.com/s-knibbs/dataclasses-jsonschema
Thanks for opening this. This is definitely in scope, but isn't something I plan on spending much time on immediately.
Note that to avoid polluting the Struct method namespace, this should probably be a top-level method in the msgspec.json namespace instead. Something like msgspec.json.json_schema perhaps?
I think supporting json schema would be critical for msgspec to be an alternative to Pydantic
to avoid polluting the Struct method namespace, this should probably be a top-level method in the msgspec.json namespace instead. Something like msgspec.json.json_schema perhaps?
👍 on this, I think having these methods be top-level is a much better idea. Ideally Struct would have no methods to avoid name collisions (e.g. someone wants a field called json_schema).
I'm not an active json-schema user - a couple questions/comments:
- We can fairly easily support generating a simple
json-schemaspec from amsgspecannotated type. I expect this to be no more than a couple hours work to get this functioning + tests + docs. - We don't currently have a mechanism for adding extra info to the generated json-schema (
title,description, ...). How important are those (or, more directly, if we release a version ofjson_schemathat doesn't support those, will anyone find it still useful)? How do other systems handle this? What metadata fields like these should we support?
I think an initial implementation without the metadata would be a good place to start.
As to what metadata to include, I think it would be good to support title and description. There is also other metadata like "pattern" or "format", but that dovetails with constraints/validation. As to how to express that metadata in Python, I would recommend PEP593 (see #154).
As to how to express that metadata in Python, I would recommend PEP593 (see https://github.com/jcrist/msgspec/issues/154).
In your mind, how would the non-functional annotations (title, description) be specified on the schema (do these go on fields? On struct types themselves?)? An example user session would be helpful here.
maybe something like:
PhoneNumber = Annotated[str, Pattern(r"\+1\d{10}"), Description("Phone number. US only. No dashes")]
class _Person(Struct):
cel: PhoneNumber
Person = Annotated[_Person, Description("A human")]
Users can choose how to handle the naming (_ prefix, Person and PersonSchema, etc).
Note: apologies for the long response
Hmmm, I don't love that (for the description bit). While it's important that whatever syntax is chosen is mypy/pyright compatible, I find it more important that it's easy to read, and doesn't require knowledge of complicated python features. Even though I'd consider myself an experienced python dev, I had to verify that Person was still callable, since it's no longer exactly a Struct class type.
I think it might be helpful to distinguish between json-schema metadata that affects runtime behavior (meaning constraints like Pattern) and metadata that's just for documentation/adding into the generated jsonschema. I think the former makes sense to handle using Annotated, since it feels more attached to a "type" to me. The latter feels more like documentation.
A few assumptions I'm making here about what's common based on reading a few example json schemas for apis I'm familiar with. Please correct me if I'm wrong on any of these:
- In a well-documented api,
descriptionannotations are likely wanted on every field in an object - These descriptions are probably more tied to the field name than to field type. A struct may have several fields containing only primitive types (e.g.
int), each with a different meaning and requiring a different description. As such, setting the descriptions as part of a type alias (to reduce indentation) will be unwieldy in practice. - Descriptions are most likely to be associated with an object field, and less likely to be associated with a type further down the tree. By this I mean that you might attach a description to a
coordinatesfield (containing an array of 2-tuples of floats), but not to the individual floats or tuples. Similar to docstring parameters, the description describes the full field, not a subcomponent.
Given this, it seems likely that we'll want to optimize the spelling for specifying json-schema metadata for:
- Adding metadata on a
Structtype itself (title, description, examples, ...) - Adding metadata on a field (title, description, ...)
Knowing very little about how people actually use json, I think some of the default behavior pydantic uses makes sense:
titledefaults to aStructclass name if not overridden in some waydescriptiondefaults to aStructdocstring if not overridden in some way- Otherwise, by default only generate constraint/type level schema information
Playing around with a few ideas of how to specify the optional metadata:
1. Use a Field type, assigned as a field default
This mirrors the api of dataclasses or pydantic.
from msgspec import Struct, Field
class User(Struct):
name: str = Field(description="The user's name")
age: int = Field(
description="""
The user's age.
Multiline descriptions still are readable IMO.
"""
)
Pros
- Familiar to users of dataclasses/pydantic/attrs
- Provides a consistent place to put other metadata for a field
- Doesn't clutter the type annotation
- Would also work for forwarding additional metadata to
defstructfor dynamic type definitions - Multiline descriptions would still be readable
Cons
- Requires some type trickery to make it work with mypy out of the box. pyright already knows how to do this.
- Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.
- Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields
2. Use a meta annotation, as part of the type annotation
from msgspec import Struct, Meta
from typing import Annotated
class User(Struct):
name: Annotated[str, Meta(description="The user's name")]
age: Annotated[
int,
Meta(
description="""
The user's age.
Multiline description here.
"""
),
]
Pros
- Can be applied anywhere in a type definition, can be used consistently across all types msgspec supports
- Still works with
defstruct, metadata attached to type can be passed in like normal - At the type level this makes sense
Cons
- Clutters the type definition, less readable IMO. It's not terrible, but not ideal.
- Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields
3. So much magic
We could make this work:
import msgspec
class User(msgspec.Struct):
name: str
msgspec.doc("The user's name") # not attached to `doc` as the name
age: int
msgspec.doc("The user's age")
Pros
- Detaches metadata from type annotations/default values. Metadata is for documentation, annotations affect runtime behavior.
- I think this is pretty readable
- Multiline descriptions would still be readable
- Works fine with mypy/pyright out of the box
Cons
- It's magical.
- Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.
- Would play well with
defstruct, this is highly coupled to theStructMetametaclass. - Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields
4. Specify jsonschema metadata directly
Group jsonschema metadata together, and specify it on the struct type in a single call. This could be done a few ways:
4a. As a dict set on the class
class User(msgspec.Struct):
name: str
age: int
__json_schema__ = {
"title": "The title for the type",
"description": "The description for the type",
"properties": {
"name": {"description": "The user's name"},
"age": {"description": "The user's age"},
},
}
4b. As a typed object set on the class
class User(msgspec.Struct):
name: str
age: int
__json_schema__ = msgspec.json.SchemaConfig(
title="The title for the type",
description="The description for the type",
properties={
"name": {"description": "The user's name"},
"age": {"description": "The user's age"},
}
)
4c. Using a class decorator
# this could also use keyword arguments instead, similar to 4b above.
@msgspec.json.with_schema(
{
"title": "The title for the type",
"description": "The description for the type",
"properties": {
"name": {"description": "The user's name"},
"age": {"description": "The user's age"},
},
}
)
class User(msgspec.Struct):
name: str
age: int
All of these would work similarly:
- When generating the schema, the manually specified object would be first deepcopied to form a base template
- The type information from the annotations/default would then be added to the template using
setdefault. This means that anything manually specified would remain, and the schema generation would only fill in the blanks
Pros
- Doesn't clutter type annotations
- Separates things that affect runtime behavior from documentation. Very long descriptions are fine for readability, since they aren't mixed in with things affecting runtime behavior.
- Gives full flexibility for specifying the json schema.
- Grouping the metadata all in one spot in a way that fully matches jsonschema syntax may be nicer for users familiar with jsonschema
- Straightforward to make work with
defstruct - Straightforward to make work other typed-object-like-things (e.g.
TypedDict) - Can customize the annotations for deeply nested fields
Cons
- Grouping the metadata all in one spot may make it easier to lose alignment with the fields on the type. For example, if you add a new field you'd also have to remember to add a description in the schema bit. This would also happen in the other options, except that the descriptions are closer to the annotations visually, so there's more of a visual cue to do the right thing.
- Users would need to be more familiar with the syntax for jsonschema. Using a custom object or decorator would help mitigate that a bit, since keyword arguments + types would help document what fields are expected where.
I'm pretty mixed on all of these. I think option 3 is probably out since it's the least flexible/most magical. I kinda like option 4a or 4b. Option 1 or 2 are also fine, but less ideal in my mind since they clutter up the type definitions.
Thoughts?
I think it might be helpful to distinguish between json-schema metadata that affects runtime behavior (meaning constraints like Pattern) and metadata that's just for documentation/adding into the generated jsonschema. I think the former makes sense to handle using Annotated, since it feels more attached to a "type" to me. The latter feels more like documentation.
I think this is a valid point. Tools like hypothesis should not depend on the title/description, just on the type constraints.
Descriptions are most likely to be associated with an object field, and less likely to be associated with a type further down the tree. By this I mean that you might attach a description to a coordinates field (containing an array of 2-tuples of floats), but not to the individual floats or tuples. Similar to docstring parameters, the description describes the full field, not a subcomponent.
I think this is probably true in the example you gave, but not in general. For example:
Latitude = Annotated[
float,
Ge(-90), Le(90),
Description("Angular distance of a place north or south of the earth's equator")
]
Longitude = Annotated[
float,
Ge(-180), Le(180),
Description(
"Angular distance of a place east (positive) or"
" west (negative) of the meridian at Greenwich, England"
)
]
Position = Tuple[Latitude, Longitude]
Here I'm attaching important constraints and descriptions to raw types, not the tuple. I think the same thing applies to something like Annotated[str, Description("Date of birth in the DD/MM/YYYY format")].
Knowing very little about how people actually use json, I think some of the default behavior pydantic uses makes sense:
titledefaults to aStructclass name if not overridden in some waydescriptiondefaults to aStructdocstring if not overridden in some way- Otherwise, by default only generate constraint/type level schema information
I agree with using the struct class name and description by default. I imagine users won't even need to deal with the slight wonkiness of my example above (Annotated[_Person, ...]), which IMO also makes that less of an issue because it'll be infrequent and when it is used it does work, even if it is not super user friendly.
1. Use a
Fieldtype, assigned as a field defaultThis mirrors the api of dataclasses or pydantic.
- Familiar to users of dataclasses/pydantic/attrs
Pydantic does support PEP593 and will support annotated-types in the future.
- Provides a consistent place to put other metadata for a field
- Doesn't clutter the type annotation
- Multiline descriptions would still be readable
I'm not sure how these are any different, better:
class Position:
latitude: float = Field(
ge=-90,
le=90,
description=(
"Angular distance of a place north or south of the earth's equator."
" Foo bar baz, lorem ipsum dolor amet?"
),
)
longitude: float
class Position:
latitude: Annotated[
float,
Ge(-90),
Le(90),
Description(
"Angular distance of a place north or south of the earth's equator."
" Foo bar baz, lorem ipsum dolor amet?"
),
]
longitude: float
I feel like this are about the same in terms of clutter and verbosity. Formatters (black) can handle both just fine.
And in both cases I would recommend doing what I did in the first example and breaking these out into their own types, or maybe breaking out just latitude_description = "..." or any other variation.
Cons
- Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.
I think this is a non-minor issue.
I'll add another one:
- Cannot be applied to complex types.
Since this can only happen at the field level, you can't apply it to unions I think, for example with the Annotated variant you can do something like:
UnixTimestamp = Annotated[float, Ge(0)]
DateTime = Annoated[str, Pattern(...)]
class LogRecord:
timestamp: DateTime | UnixTimestamp
I don't think you'd be able to express this using default values.
2. Use a
metaannotation, as part of the type annotationCons
- Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields
As per discussion above, I think most of the time the metadata at the struct level will be just title and description, which like you said can be captured from the class name and docstring. And for the more complex cases there is the option of wrapping the struct with Annotated[...] which I do agree is not ideal but is nice in the sense that it's not a special case but rather just more of the same.
With respect to using Annotated[..., Field(...)], I would recommend making Field() just return an iterable of the more granular constraints that get destructured using Annoated[..., *Field(...)] to avoid introducing complexity / new types that need to be parsed (by unpacking/destructuring into the more granular constraints the same thing that parses those can parse Field()).
4. Specify jsonschema metadata directly
I don't like these much.
- You'd miss any interoperability with Hypothesis, Pydantic, etc.
- It makes things less cluttered; it's just moving the clutter somewhere else and away from the field/type it's impacting.
- This would be really hard to maintain for users because of drift between the JSON schema and fields/types.
This said, I do think it's good to have a fully flexible way to write or modify the schema. So maybe v0 of this feature can be that:
class Struct:
@classmethod
def __modify_json_schema__(cls, schema: JsonSchema) -> Json:
return schema
```
The API is that you get the `schema` that msgspec generated and modify or replace it. V0 would actually be called like `Person.__modify_json_schema__({})`. In other words, users would have to hardcode their schema:
```python
from msgspec import Struct, json_schema, JsonSchema
class Person(Struct):
def __modify_json_schema(cls, schema: JsonSchema) -> JsonSchema:
return {
"title": "The title for the type",
"description": "The description for the type",
"properties": {
"name": {"description": "The user's name"},
"age": {"description": "The user's age"},
},
}
json_schema(Person)
Where the initial implementation of json_schema is:
def json_schema(__tp: Any) -> Json:
assert issubclass(__tp, Struct), "Only Structs are supported for now"
return __tp.__modify_json_schema({})
This lets you punt on the other decisions while offering some basic support, with the ability to expand what and how json_schema introspects into the type to extract metadata from default values, type definitions, etc.
In summary what I would do is:
- Add the
__modify_json_schema__method and the top leveljson_schemamethod. - Make
json_schemasmart enough to pick up basic type information - Make
json_schemasmart enough to understand constraints packed intoAnnotated(age: Annotated[int, Ge(0)]). - Add conveniences like
Annotated[..., *Field(...)]. - Handle the trickier bits like nested
Annotateds.
Sorry, to clarify - I'm only talking about different ways to represent the non-functional metadata (title, description, ...). I agree that there are use cases for adding constraints to complex/nested types, in which case Annotated types seem fine if not ideal. However, all the example json schemas I've found only have descriptions set for object fields.
If we rely on annotated types alone, that means that any user that wants to use msgspec to document the api will need to wrap every field in an Annotated annotation. Looking at existing specs it seems like constrained types are less common, but descriptions are common. This seems too verbose IMO, if possible I'd like to find a concise and readable api, if only for adding descriptions (and other documentation-style metadata) to struct fields.
# just to document this
class User(msgspec.Struct):
name: str
age: int
# you'd have to transform it to this
class User(msgspec.Struct):
name: Annotated[str, Description("...")]
age: Annotated[int, Description("...")]
Description, examples, title, ... are all likely to be longer strings. Even if black can format it, it still will turn every struct field into a larger block. Moving the descriptions to a different location would IMO increase readability of the code bits that have actual runtime effects (the type/constraints and default value).
Perhaps adding a constraint to age would help:
class User(msgspec.Struct):
name: str
age: Annotated[int, Ge(0)]
__json_schema__ = {
"title": "The title for the type",
"description": "The description for the type",
"properties": {
"name": {"description": "The user's name"},
"age": {"description": "The user's age. This might be a really long or multiline string."},
},
}
Type/constraints shouldn't go in __json_schema__, since that should be inferred by the field types. We could also make it an error to document properties on a struct that don't exist (e.g. user deletes a struct field, but forgets to delete the bit that's separated in __json_schema__.
I don't love this solution, but I also don't hate it. I'm mainly worried about trying to make the common case readable, which I'm guessing is:
- no constraints
- just adding a description to each field of the json object
Thoughts? Thanks for bouncing ideas here, your feedback on this issue is very welcome.
I see. If this is just for title/description, I am less concerned. While I still think there is potential for some ecosystem of "universal json schema generator" that would require a cross-library standard for where to put this information, I do think that's not as important as having type/validation interoperability for Hypothesis, sharing a type between msgspec and Pydantic, etc.
I guess let me ask this: could you support both usages? The Annotated[str, Description("...")] version for users that may want to interoperate with other libraries and don't care about the verbosity and then the other usage for other use cases?
What do you think of this as a way to (1) make sure you can't set a description for a field that doesn't exist and (2) do the modify schema thing I was talking about:
class User(msgspec.Struct):
name: str
age: Annotated[int, Ge(0)]
@classmethod
def __modify_json_schema__(cls, schema: JsonSchema) -> JsonSchema:
schema["title"] = "Override User's Title"
schema["description"] = "The description for the type"
schema["properties"]["name"]["description"] = "The user's name"
msgspec would create the structure and the user would fill in the data. Or they could just return a new dict in which case they can set a description for a non-existent field, and it would look a lot like what you have above.