msgspec Support json-schema generation

It would be helpful if JSON schema generation was supported

class User(msgspec.Struct):
    """A new type describing a User"""
    name: str
    groups: Set[str] = set()
    email: Optional[str] = None

schema = User.json_schema()

Similar to the functionality seen in https://github.com/s-knibbs/dataclasses-jsonschema

Jun 19 '22 20:06 old-ocean-creature

Thanks for opening this. This is definitely in scope, but isn't something I plan on spending much time on immediately.

Note that to avoid polluting the Struct method namespace, this should probably be a top-level method in the msgspec.json namespace instead. Something like msgspec.json.json_schema perhaps?

Jun 20 '22 18:06 jcrist

I think supporting json schema would be critical for msgspec to be an alternative to Pydantic

to avoid polluting the Struct method namespace, this should probably be a top-level method in the msgspec.json namespace instead. Something like msgspec.json.json_schema perhaps?

👍 on this, I think having these methods be top-level is a much better idea. Ideally Struct would have no methods to avoid name collisions (e.g. someone wants a field called json_schema).

Jul 23 '22 16:07 adriangb

I'm not an active json-schema user - a couple questions/comments:

We can fairly easily support generating a simple json-schema spec from a msgspec annotated type. I expect this to be no more than a couple hours work to get this functioning + tests + docs.
We don't currently have a mechanism for adding extra info to the generated json-schema (title, description, ...). How important are those (or, more directly, if we release a version of json_schema that doesn't support those, will anyone find it still useful)? How do other systems handle this? What metadata fields like these should we support?

Jul 23 '22 17:07 jcrist

I think an initial implementation without the metadata would be a good place to start.

As to what metadata to include, I think it would be good to support title and description. There is also other metadata like "pattern" or "format", but that dovetails with constraints/validation. As to how to express that metadata in Python, I would recommend PEP593 (see #154).

Jul 23 '22 17:07 adriangb

As to how to express that metadata in Python, I would recommend PEP593 (see https://github.com/jcrist/msgspec/issues/154).

In your mind, how would the non-functional annotations (title, description) be specified on the schema (do these go on fields? On struct types themselves?)? An example user session would be helpful here.

Jul 23 '22 17:07 jcrist

maybe something like:

PhoneNumber = Annotated[str, Pattern(r"\+1\d{10}"), Description("Phone number. US only. No dashes")]

class _Person(Struct):
    cel: PhoneNumber

Person = Annotated[_Person, Description("A human")]

Users can choose how to handle the naming (_ prefix, Person and PersonSchema, etc).

Jul 23 '22 18:07 adriangb

Note: apologies for the long response

Hmmm, I don't love that (for the description bit). While it's important that whatever syntax is chosen is mypy/pyright compatible, I find it more important that it's easy to read, and doesn't require knowledge of complicated python features. Even though I'd consider myself an experienced python dev, I had to verify that Person was still callable, since it's no longer exactly a Struct class type.

I think it might be helpful to distinguish between json-schema metadata that affects runtime behavior (meaning constraints like Pattern) and metadata that's just for documentation/adding into the generated jsonschema. I think the former makes sense to handle using Annotated, since it feels more attached to a "type" to me. The latter feels more like documentation.

A few assumptions I'm making here about what's common based on reading a few example json schemas for apis I'm familiar with. Please correct me if I'm wrong on any of these:

In a well-documented api, description annotations are likely wanted on every field in an object
These descriptions are probably more tied to the field name than to field type. A struct may have several fields containing only primitive types (e.g. int), each with a different meaning and requiring a different description. As such, setting the descriptions as part of a type alias (to reduce indentation) will be unwieldy in practice.
Descriptions are most likely to be associated with an object field, and less likely to be associated with a type further down the tree. By this I mean that you might attach a description to a coordinates field (containing an array of 2-tuples of floats), but not to the individual floats or tuples. Similar to docstring parameters, the description describes the full field, not a subcomponent.

Given this, it seems likely that we'll want to optimize the spelling for specifying json-schema metadata for:

Adding metadata on a Struct type itself (title, description, examples, ...)
Adding metadata on a field (title, description, ...)

Knowing very little about how people actually use json, I think some of the default behavior pydantic uses makes sense:

title defaults to a Struct class name if not overridden in some way
description defaults to a Struct docstring if not overridden in some way
Otherwise, by default only generate constraint/type level schema information

Playing around with a few ideas of how to specify the optional metadata:

1. Use a `Field` type, assigned as a field default

This mirrors the api of dataclasses or pydantic.

from msgspec import Struct, Field


class User(Struct):
    name: str = Field(description="The user's name")
    age: int = Field(
        description="""
            The user's age.

            Multiline descriptions still are readable IMO.
        """
    )

Pros

Familiar to users of dataclasses/pydantic/attrs
Provides a consistent place to put other metadata for a field
Doesn't clutter the type annotation
Would also work for forwarding additional metadata to defstruct for dynamic type definitions
Multiline descriptions would still be readable

Cons

Requires some type trickery to make it work with mypy out of the box. pyright already knows how to do this.
Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.
Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields

2. Use a `meta` annotation, as part of the type annotation

from msgspec import Struct, Meta
from typing import Annotated


class User(Struct):
    name: Annotated[str, Meta(description="The user's name")]
    age: Annotated[
        int,
        Meta(
            description="""
            The user's age.

            Multiline description here.
            """
        ),
    ]

Pros

Can be applied anywhere in a type definition, can be used consistently across all types msgspec supports
Still works with defstruct, metadata attached to type can be passed in like normal
At the type level this makes sense

Cons

Clutters the type definition, less readable IMO. It's not terrible, but not ideal.
Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields

3. So much magic

We could make this work:

import msgspec

class User(msgspec.Struct):
    name: str
    msgspec.doc("The user's name")  # not attached to `doc` as the name

    age: int
    msgspec.doc("The user's age")

Pros

Detaches metadata from type annotations/default values. Metadata is for documentation, annotations affect runtime behavior.
I think this is pretty readable
Multiline descriptions would still be readable
Works fine with mypy/pyright out of the box

Cons

It's magical.
Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.
Would play well with defstruct, this is highly coupled to the StructMeta metaclass.
Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields

4. Specify jsonschema metadata directly

Group jsonschema metadata together, and specify it on the struct type in a single call. This could be done a few ways:

4a. As a dict set on the class

class User(msgspec.Struct):
    name: str
    age: int

    __json_schema__ = {
        "title": "The title for the type",
        "description": "The description for the type",
        "properties": {
            "name": {"description": "The user's name"},
            "age": {"description": "The user's age"},
        },
    }

4b. As a typed object set on the class

class User(msgspec.Struct):
    name: str
    age: int

    __json_schema__ = msgspec.json.SchemaConfig(
        title="The title for the type",
        description="The description for the type",
        properties={
            "name": {"description": "The user's name"},
            "age": {"description": "The user's age"},
        }
    )

4c. Using a class decorator

# this could also use keyword arguments instead, similar to 4b above.
@msgspec.json.with_schema(
    {
        "title": "The title for the type",
        "description": "The description for the type",
        "properties": {
            "name": {"description": "The user's name"},
            "age": {"description": "The user's age"},
        },
    }
)
class User(msgspec.Struct):
    name: str
    age: int

All of these would work similarly:

When generating the schema, the manually specified object would be first deepcopied to form a base template
The type information from the annotations/default would then be added to the template using setdefault. This means that anything manually specified would remain, and the schema generation would only fill in the blanks

Pros

Doesn't clutter type annotations
Separates things that affect runtime behavior from documentation. Very long descriptions are fine for readability, since they aren't mixed in with things affecting runtime behavior.
Gives full flexibility for specifying the json schema.
Grouping the metadata all in one spot in a way that fully matches jsonschema syntax may be nicer for users familiar with jsonschema
Straightforward to make work with defstruct
Straightforward to make work other typed-object-like-things (e.g. TypedDict)
Can customize the annotations for deeply nested fields

Cons

Grouping the metadata all in one spot may make it easier to lose alignment with the fields on the type. For example, if you add a new field you'd also have to remember to add a description in the schema bit. This would also happen in the other options, except that the descriptions are closer to the annotations visually, so there's more of a visual cue to do the right thing.
Users would need to be more familiar with the syntax for jsonschema. Using a custom object or decorator would help mitigate that a bit, since keyword arguments + types would help document what fields are expected where.

I'm pretty mixed on all of these. I think option 3 is probably out since it's the least flexible/most magical. I kinda like option 4a or 4b. Option 1 or 2 are also fine, but less ideal in my mind since they clutter up the type definitions.

Thoughts?

Jul 24 '22 18:07 jcrist

I think it might be helpful to distinguish between json-schema metadata that affects runtime behavior (meaning constraints like Pattern) and metadata that's just for documentation/adding into the generated jsonschema. I think the former makes sense to handle using Annotated, since it feels more attached to a "type" to me. The latter feels more like documentation.

I think this is a valid point. Tools like hypothesis should not depend on the title/description, just on the type constraints.

Descriptions are most likely to be associated with an object field, and less likely to be associated with a type further down the tree. By this I mean that you might attach a description to a coordinates field (containing an array of 2-tuples of floats), but not to the individual floats or tuples. Similar to docstring parameters, the description describes the full field, not a subcomponent.

I think this is probably true in the example you gave, but not in general. For example:

Latitude = Annotated[
    float,
    Ge(-90), Le(90),
    Description("Angular distance of a place north or south of the earth's equator")
]
Longitude = Annotated[
    float,
    Ge(-180), Le(180),
    Description(
        "Angular distance of a place east (positive) or"
        " west (negative) of the meridian at Greenwich, England"
     )
]
Position = Tuple[Latitude, Longitude]

Here I'm attaching important constraints and descriptions to raw types, not the tuple. I think the same thing applies to something like Annotated[str, Description("Date of birth in the DD/MM/YYYY format")].

Knowing very little about how people actually use json, I think some of the default behavior pydantic uses makes sense:

title defaults to a Struct class name if not overridden in some way

description defaults to a Struct docstring if not overridden in some way

Otherwise, by default only generate constraint/type level schema information

I agree with using the struct class name and description by default. I imagine users won't even need to deal with the slight wonkiness of my example above (Annotated[_Person, ...]), which IMO also makes that less of an issue because it'll be infrequent and when it is used it does work, even if it is not super user friendly.

1. Use a Field type, assigned as a field default

This mirrors the api of dataclasses or pydantic.

Familiar to users of dataclasses/pydantic/attrs

Pydantic does support PEP593 and will support annotated-types in the future.

Provides a consistent place to put other metadata for a field

Doesn't clutter the type annotation

Multiline descriptions would still be readable

I'm not sure how these are any different, better:

class Position:
    latitude: float = Field(
        ge=-90,
        le=90,
        description=(
             "Angular distance of a place north or south of the earth's equator."
             " Foo bar baz, lorem ipsum dolor amet?"
        ),
    )
    longitude: float

class Position:
    latitude: Annotated[
        float,
        Ge(-90),
        Le(90),
        Description(
             "Angular distance of a place north or south of the earth's equator."
             " Foo bar baz, lorem ipsum dolor amet?"
        ),
    ]
    longitude: float

I feel like this are about the same in terms of clutter and verbosity. Formatters (black) can handle both just fine. And in both cases I would recommend doing what I did in the first example and breaking these out into their own types, or maybe breaking out just latitude_description = "..." or any other variation.

Cons

Metadata added to non-struct-field types not supported this way. We'd either not support that at all, or would need a different mechanism.

I think this is a non-minor issue.

I'll add another one:

Cannot be applied to complex types.

Since this can only happen at the field level, you can't apply it to unions I think, for example with the Annotated variant you can do something like:

UnixTimestamp = Annotated[float, Ge(0)]
DateTime = Annoated[str, Pattern(...)]

class LogRecord:
    timestamp: DateTime | UnixTimestamp

I don't think you'd be able to express this using default values.

2. Use a meta annotation, as part of the type annotation

Cons

Some other mechanism would be needed for modifying metadata at the struct level, since this only applies to fields

As per discussion above, I think most of the time the metadata at the struct level will be just title and description, which like you said can be captured from the class name and docstring. And for the more complex cases there is the option of wrapping the struct with Annotated[...] which I do agree is not ideal but is nice in the sense that it's not a special case but rather just more of the same.

With respect to using Annotated[..., Field(...)], I would recommend making Field() just return an iterable of the more granular constraints that get destructured using Annoated[..., *Field(...)] to avoid introducing complexity / new types that need to be parsed (by unpacking/destructuring into the more granular constraints the same thing that parses those can parse Field()).

4. Specify jsonschema metadata directly

I don't like these much.

You'd miss any interoperability with Hypothesis, Pydantic, etc.
It makes things less cluttered; it's just moving the clutter somewhere else and away from the field/type it's impacting.
This would be really hard to maintain for users because of drift between the JSON schema and fields/types.

This said, I do think it's good to have a fully flexible way to write or modify the schema. So maybe v0 of this feature can be that:

class Struct:
    @classmethod
    def __modify_json_schema__(cls, schema: JsonSchema) -> Json:
        return schema
```

The API is that you get the `schema` that msgspec generated and modify or replace it. V0 would actually be called like `Person.__modify_json_schema__({})`. In other words, users would have to hardcode their schema:

```python
from msgspec import Struct, json_schema, JsonSchema

class Person(Struct):
    def __modify_json_schema(cls, schema: JsonSchema) -> JsonSchema:
        return {
          "title": "The title for the type",
          "description": "The description for the type",
          "properties": {
              "name": {"description": "The user's name"},
              "age": {"description": "The user's age"},
          },
      }

json_schema(Person)

Where the initial implementation of json_schema is:

def json_schema(__tp: Any) -> Json:
   assert issubclass(__tp, Struct), "Only Structs are supported for now"
   return __tp.__modify_json_schema({})

This lets you punt on the other decisions while offering some basic support, with the ability to expand what and how json_schema introspects into the type to extract metadata from default values, type definitions, etc.

In summary what I would do is:

Add the __modify_json_schema__ method and the top level json_schema method.
Make json_schema smart enough to pick up basic type information
Make json_schema smart enough to understand constraints packed into Annotated (age: Annotated[int, Ge(0)]).
Add conveniences like Annotated[..., *Field(...)].
Handle the trickier bits like nested Annotateds.

Jul 24 '22 23:07 adriangb

Sorry, to clarify - I'm only talking about different ways to represent the non-functional metadata (title, description, ...). I agree that there are use cases for adding constraints to complex/nested types, in which case Annotated types seem fine if not ideal. However, all the example json schemas I've found only have descriptions set for object fields.

If we rely on annotated types alone, that means that any user that wants to use msgspec to document the api will need to wrap every field in an Annotated annotation. Looking at existing specs it seems like constrained types are less common, but descriptions are common. This seems too verbose IMO, if possible I'd like to find a concise and readable api, if only for adding descriptions (and other documentation-style metadata) to struct fields.

# just to document this
class User(msgspec.Struct):
    name: str
    age: int

# you'd have to transform it to this
class User(msgspec.Struct):
    name: Annotated[str, Description("...")]
    age: Annotated[int, Description("...")]

Description, examples, title, ... are all likely to be longer strings. Even if black can format it, it still will turn every struct field into a larger block. Moving the descriptions to a different location would IMO increase readability of the code bits that have actual runtime effects (the type/constraints and default value).

Perhaps adding a constraint to age would help:

class User(msgspec.Struct):
    name: str
    age: Annotated[int, Ge(0)]

    __json_schema__ = {
        "title": "The title for the type",
        "description": "The description for the type",
        "properties": {
            "name": {"description": "The user's name"},
            "age": {"description": "The user's age. This might be a really long or multiline string."},
        },
    }

Type/constraints shouldn't go in __json_schema__, since that should be inferred by the field types. We could also make it an error to document properties on a struct that don't exist (e.g. user deletes a struct field, but forgets to delete the bit that's separated in __json_schema__.

I don't love this solution, but I also don't hate it. I'm mainly worried about trying to make the common case readable, which I'm guessing is:

no constraints
just adding a description to each field of the json object

Thoughts? Thanks for bouncing ideas here, your feedback on this issue is very welcome.

Jul 25 '22 00:07 jcrist

I see. If this is just for title/description, I am less concerned. While I still think there is potential for some ecosystem of "universal json schema generator" that would require a cross-library standard for where to put this information, I do think that's not as important as having type/validation interoperability for Hypothesis, sharing a type between msgspec and Pydantic, etc.

I guess let me ask this: could you support both usages? The Annotated[str, Description("...")] version for users that may want to interoperate with other libraries and don't care about the verbosity and then the other usage for other use cases?

What do you think of this as a way to (1) make sure you can't set a description for a field that doesn't exist and (2) do the modify schema thing I was talking about:

class User(msgspec.Struct):
    name: str
    age: Annotated[int, Ge(0)]
    
    @classmethod
    def __modify_json_schema__(cls, schema: JsonSchema) -> JsonSchema:
        schema["title"] = "Override User's Title"
        schema["description"] = "The description for the type"
        schema["properties"]["name"]["description"] = "The user's name"

msgspec would create the structure and the user would fill in the data. Or they could just return a new dict in which case they can set a description for a non-existent field, and it would look a lot like what you have above.

Jul 25 '22 00:07 adriangb

msgspec msgspec copied to clipboard

Support json-schema generation

1. Use a Field type, assigned as a field default

2. Use a meta annotation, as part of the type annotation

3. So much magic

4. Specify jsonschema metadata directly

4a. As a dict set on the class

4b. As a typed object set on the class

4c. Using a class decorator

1. Use a Field type, assigned as a field default

2. Use a meta annotation, as part of the type annotation

4. Specify jsonschema metadata directly

msgspec
msgspec copied to clipboard

1. Use a `Field` type, assigned as a field default

2. Use a `meta` annotation, as part of the type annotation

1. Use a `Field` type, assigned as a field default

2. Use a `meta` annotation, as part of the type annotation