msgspec
msgspec copied to clipboard
Support annotated-types for metadata and constraint specification
My understanding is that msgspec currently only supports basic type/schema validation, but not more complex validations like a regex pattern or arbitrary user functions, which Pydantic and other validators do support.
I think it would be really interesting if msgspec adopted annotated-types (maybe as an optional dependency), which would not only enable an API for these sorts of constraints but also:
- Improve interoperability with other libraries (Pydantic V2 will support annotated-types, so users could go back and forth or share types between the two libraries)
- Get automatic support for other parts of the ecosystem like Hypothesis
Thanks for opening this, it highlights an important distinction between the libraries.
msgspec.Struct constructors do no runtime type validation (unlike those of pydantic.BaseModel). The argument for this is that type annotations + static analysis tools (or tests) should be sufficient for ensuring the correctness of your own code, you shouldn't have to pay the cost of runtime type checking for code you already know to be correct. As such msgspec (currently) only supports type level constraints that can be statically validated by tools like mypy or pyright. A constraint like >= 0 can't be statically validated, so it's not clear what msgspec should do here.
Right now I have a few ideas:
- Don't support these features. I personally don't need them. If we think of
msgspecas comparable to rust's serde or golang's json, then we're already feature complete without them, those tools also don't handle value validation. - We do add support, but the checks are only applied during decoding, not in the constructors. I'm pretty adamant that constructors should always be fast. Instead, we'd add a new function for running validators on existing data. This might look like:
class PositivePoint(msgspec.Struct):
x: Annotated[int, Ge(0)]
y: Annotated[int, Ge(0)]
# no value validation happens here
x = Point(1, 2)
# This means that this won't error
bad = Point(-1, 0)
# Value validation checks must be explicitly run, this raises a ValidationError
msgspec.validate(bad)
points = [Point(1, 2), Point(3, 4), Point(5, 6)]
# For non-struct types, a type can be passed in explicitly
msgspec.validate(points, type=list[Point])
# Validators are also automatically run during decoding. This would raise a `ValidationError`
msgspec.json.decode(b'{"x": -1, "y": 0}', type=Point)
This lets the user control when (potentially expensive) validation checks are run, and by default only runs them when processing external inputs. If you know your own internal code is correct, there's no need to pay the runtime cost.
As for how to spell the validation checks, I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem). I also personally find the annotated type syntax to be a bit verbose, and if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field type anyway to encode that. Idk, there doesn't seem to be a concise way of spelling field annotations in python currently.
Thoughts?
I think supporting these sorts of validations natively is important both for user experience and performance. I think providing validation and json schema is what made Pydantic successful, so there are obviously a lot of users that want these features in their data parsing/validation library of choice.
I 100% agree not validating constructors by default is the right move. I like your example where validation only happens explicitly or during decoding.
I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem)
Yup totally a chicken <-> egg problem. That said, Hypothesis and Pydantic v2 already plan on supporting annotated-types. That means FastAPI, etc. will support it, hence there will soon be a relatively large ecosystem. And you could always make it an optional feature: only support it if installed, if an extra is selected, etc.
I also personally find the annotated type syntax to be a bit verbose
One way to mitigate this is type aliases, which work great for many common constraints:
from annotated_types import IsDigit
PhoneNumber = IsDigit[str]
The same could be done for Positive (YearsOfExperience = Positive[int]).
But yes, as soon as users need to input data (Ge(0)) it can get a bit verbose. PEP593 explicitly calls this out and I hope that at some point they'll add support for easier parametrization (Ge[int, 0] so that you don't have to import and use Annotated everywhere).
if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field type anyway to encode that
One thing you could do is make a Field object with an __iter__ method that can be unpacked into multiple constraints:
from msgspec import Field
PhoneNumber = Annotated[str, *Field(pattern=r"", description="cel", min_len=123)]
(see https://github.com/annotated-types/annotated-types/blob/main/annotated_types/init.py#L132-L154)
This is only valid in Python >3.10 though.
In summary, yes, this isn't perfect, but I think it's the best option we have right now that doesn't do wonky stuff like mess with default values or create a library specific way to attach metadata. And I believe the situation will get better in the future (if there is a future PEP to make it less verbose, if an ecosystem of common metadata arises, once everyone is using 3.10 and can unpack metadata, etc.)
One thing I'll note about the annotated-types design that might not be immediately obvious: the validation implementation is not encoded into the metadata, which would allow msgspec to interpret Ge(0), Predicate(str.islower) or Pattern(r"Foo") into some fast C code. And users persist a Decoder object around I think it should be possible to "compile" a decoder ahead of time so that there is zero overhead if no constraints are included.
Hmmm, ok. A few questions:
- Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?
- What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?
- Can we get by without the
validatefunction for now, only supporting validating constraints on decode?
Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?
I think it would be good for a lot of use cases, but certainly not all. Why do you want to avoid supporting arbitrary python callables?
What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?
I don't have a survey or anything, but I'd say Len, Ge/Le/Gt/Lt and Pattern are probably the most common.
Can we get by without the
validatefunction for now, only supporting validating constraints on decode?
I'd say so, yes. I think that's how it will be used most of the time.
Why do you want to avoid supporting arbitrary python callables?
Nothing against it in particular. However, the current way the internal type tree is represented makes supporting these tricky. All the builtin constraints can mostly only be applied to one type (and cases where they can apply to multiple are already explicitly forbidden in msgspec due to ambiguity). This makes our internal representation of these constraints easy.
If we support arbitrary user defined constraints, then we'd need to modify our internals much more to be able represent e.g.
class Example:
a: Annotated[int, Predicate(custom_1)] | Annotated[str, Predicate(custom_2)]
Not sure I completely follow, but it sounds like it's internal complexity, which is a fair reason. Maybe custom predicates can be saved for later? I think they are a super valuable feature, but I do think the basic Ge ones are an easier place to start.
Yeah, it'd mess with the internals in a way that's doable but would require more thought. I don't think implementing any of the builtin constraints will require much of a refactor - the main issue there is how the users spell them (tieing in to #125).
Posting here because this is specific to annotated-types although it is also related to the discussion in #125.
I think I at least found a solution for the discussion about Annotated[..., Field(...)] vs. Annotated[..., *Field(...)]: https://github.com/annotated-types/annotated-types/commit/f59cf6d1b5255a0fe359b93896759a180bec30ae
With that change anything that is compatible with annotated-types would be able to interpret Annotated[..., Field(...)]. So, for example, you could define a type Annotated[int, msgspec.Field(ge=0)] and Hypothesis or even Pydantic would be able to interpret it.
You could also choose to use the same object for the foo: int = Field(ge=0) syntax (that's what Pydantic does), but then Hypothesis and the rest of the annotated-types ecosystem would not recognize it, but you'd give users the option to choose their preferred syntax.