Enforcing runtime type validation for Structs (and extending validation to functions)
Description
Hi there!
Would it be possible to add an optional flag that enforces type validation for Structs at runtime? While I understand your rationale for the current behavior (static type checker + msgspec for external data is more performant), not all teams have the same level of maturity in terms of adopting best practices. More importantly, not all packages are 100% type-annotated, and even if they were, unlike in a statically typed language, there are no guarantees that the upstream implementation is actually accurate/adheres to those type hints.
In addition, having an optional feature like this could also allow msgspec to potentially serve as a replacement for something like typeguard by having a function decorator like Pydantic's validate_call.
Having the option for msgspec to serve as a fully-featured, more performant alternative to Pydantic (+ typeguard) would also allow the project to become more popular -- it would certainly encourage me to make the switch if these features were built-in, even under an optional flag somewhere.
Given that msgspec already has all the logic for type validation built-in, it just feels like a missed opportunity that something that would be relatively easy to add (correct me, if I'm wrong here) and could expand the user base/"target market" of the library quite significantly is not currently supported.
PS: I have also reviewed the only issue I could find with a similar request (which offers a great solution, but it's always nicer when something's built-in).
Thanks for opening this. I think we're at the point now where this is something we could support, but I'd definitely want it to be off by default.
I also think I'd avoid doing any conversion in the runtime type-checked __init__ - we'd match the behavior that mypy would accept but would check it at runtime.
Example user session:
import datetime
import msgspec
# Enable runtime type checking in init (not 100% happy with this keyword name,
# suggestions welcome).
class Example(msgspec.Struct, typecheck_init=True):
name: str
birthday: datetime.date
# This would pass
Example("alice", datetime.date(1990, 1, 2))
# This would error since we don't coerce types like convert, just type check
Example("alice", "1990-01-02")
Would this work for you? Can you comment more on your use case and how msgspec.convert is insufficient for your needs?
Thank you for considering the request! And the example you showed above (a Struct that is validated at initialization) is exactly the workflow I'm looking for.
I wasn't aware of the msgspec.convert function previously. Having read through the docs for it, I suppose it also allows for the same functionality, but I'd still prefer to have a flow like the one in your example, where I can directly call the Struct and pass in the relevant args, instead of first defining the args as a dict and then doing msgspec.convert(args_dict, SomeStruct), as shown in the docs here. Quality-of-life improvements definitely matter.
Finally, a decorator for enforcing function arguments that automatically creates the corresponding struct and checks that the types match (like validate_call in Pydantic) would also be pretty amazing.
Makes sense. I think both an option in msgspec.Struct types and a function decorator are valid and useful features we could implement here.
One last question - what are your thoughts on sampling when validating collections (as beartype and typeguard do)? When type checking a collection (list, set, dict, ...), these libraries only validate one element by default (this is applied recursively throughout the type). The argument being that validating a sample is likely enough to catch common mistakes while being O(1) in datasize. I think we'd make this configurable if added, but I'm not sure what the default should be.
Pros of sampling:
- It's much less overhead, making enabling runtime more reasonable
- Users are unlikely to change default settings, we might want to fix the fast option by default
- Most collections are homogeneously typed; if one value has an incorrect type, all values probably have an incorrect type
Cons of sampling:
- It's only a sample and may miss a bug. Users may expect the data to be fully checked, not probabilisticly checked. If most errors are caught by the decorator the user may be unprepared for the rare occasion they slip through.
Given the above, I'm leaning towards defaulting to checking the complete values by default, but:
- Emphasizing heavily in the documentation that the sampling approach is more efficient and will catch most internal bugs
- Perhaps having the default value configured via an environment variable so users can set
MSGSPEC_VALIDATE=sampleor something to enable sampling instead of the default of complete checking. This could also support definingMSGSPEC_VALIDATE=0to disable all runtime validation checking, which may be useful in a production setting (idk).
One last question - what are your thoughts on sampling when validating collections (as beartype and typeguard do)?
Great question! I hadn't really considered the long tail of validating 1M-long lists when I created the issue. Your question did encourage me to go on a detour of the beartype repo, and it seems that its author has been considering a O(log n) alternative, which randomly sample greater than one sample. I definitely like that idea, since relying on a single sample feels a little finicky to me.
Ultimately, though, it should definitely be something that's configurable from day-one (unlike bearguard) and left to the user's control. I don't think a validation library that doesn't let you validate (all) the data makes sense.
Regarding the default setting, I have the following thoughts:
-
If I created a JSON file with a 1M-long list,
msgspecwould presumably validate the type for each element in the list when decoding the file? If so, defaulting to sampling would bifurcate the behaviour ofStructbased on whether we're validating from a__init__call or a.decodecall. Alternatively, ifstrategycontrols behaviour for both cases and they setstrategy="sample"because they mostly expect runtime validation, we leave the possibility of incorrect values if they do, even occasionally, decode external data. As an aside, the fact that one may want different behaviour while deserializing and as part of an__init__call probably requires some thought into what the final UX and docs might look like to avoid a lot of confusion there. -
In general, I think users mostly expect everything to be validated.
bearguard's insistence onO(1)guarantees definitely came off as very unusual to me. If I was concerned with utmost performance at the cost of correctness, then I probably wouldn't be using a validation library to begin with. -
How often do people actually pass in a 1M-long list as an argument to anything? The only use cases I can think of are dealing with data frames or ML training data, but in both those cases, there are more specialized tools and libraries (
pandas,polars, etc.) that already have built-in validation to ensure the types fit. These tools also have their own serialization/deserialization methods to directly read/write data, so I can't imagine anyone would want to usemsgspecas part of that workflow. -
One other option, that the author of beartype also mentioned in an issue, is to have some heuristic based on the length of the collection/container. I suppose a very simple fix in this regard would be check for any
max_lengthconstraints before doing a potential very expensive type-check on all elements (though this, of course, doesn't apply to the.decodeflow because thelenis not known beforehand). Beyond that, if one could somehow specify thestrategybased on thelenof the container, that'd be a nice-to-have, at the cost of making the API more complex.
So, with all that said, I think defaulting to checking all elements in a collection, while also giving the user the ability to choose otherwise, makes the most sense.
Also wanted to mention this video from pydantic's author on the move to Rust for v2.0. In it, he talks about some of the same concerns (validating elements in collections) in passing. Someone also asked if this is currently parallelized and he answered no; that could be a decent performance uplift, if type-checking for collections was parallelized.
+1 for (optional, off by default) validation on __init__. If this gets implemented, a natural next step would be to support assignment validation, something like:
class Example(msgspec.Struct, validate_init=True, validate_assignment=True):
name: str
birthday: datetime.date
Any update about this? I'm writing an SDK for a web platform, like this:
async def get_info() -> InfoStruct:
data: dict[str, Any] = await get_data()
return InfoStruct(
name=process_name(data["user_name"])
)
I need to write the data dict -> Struct logic instead of directly define the struct with name, and convert the data to struct, because:
-
The
process_namefunction should be used to validate & convert name field from API, andmsgspecis not supporting custom validate or convert function now. -
User may use
asdictfunction to convert the struct into Python dict, and store them in MongoDB etc. we should get rid of the bad API fields name and use our own name.
And we decided to manually create InfoStruct object because we can get auto complete as well as type check there, if we define a dict and convert it to the struct, the DX will be bad.
Without a validate_init option, we should validate all the structs by convert it to dict and convert it back, like this:
class MyStruct(Struct):
def validate(self) -> None:
msgspec.convert(msgspec.to_builtins(self), type=self)
Which is annoying and lead to some performance issue, while we are doing useless work just for validation.
So I'm thinking about a new option like this:
class MyStruct(Struct, validate_init=True):
pass
Just like others refered a few months ago.
By the way, my workaround about this issue:
from typing import Annotated
from msgspec import Struct, convert, to_builtins, Meta
class MyStruct(Struct):
def _validate(self) -> "MyStruct":
return convert(to_builtins(self), type=self.__class__)
class TestStruct(MyStruct):
test: Annotated[int, Meta(ge=0)]
print("Should be fine:", TestStruct(test=1)._validate())
print("The following one should raise ValidationError")
print(TestStruct(test=-1)._validate())
You can use validation on init by simple add ._validate() after your struct creation, it will return the struct back.
Add self._validate() in __post_init__ won't work because it will create an infinite loop.
About performance:
I create a micro benckmark about this, use a sturct with 10 fields, each one is an int with Meta(gt=0).
Validate this 100,000 times cost about 110ms on my machine, so it shouldn't be a big problem, but still nice to see this feature become build-in one.
* How often do people actually pass in a 1M-long list as an argument to anything? The only use cases I can think of are dealing with data frames or ML training data, but in both those cases, there are more specialized tools and libraries (`pandas`, `polars`, etc.) that already have built-in validation to ensure the types fit. These tools also have their own serialization/deserialization methods to directly read/write data, so I can't imagine anyone would want to use `msgspec` as part of that workflow.
@mjkanji I'm actually doing this. Using custom (tabular) types with a hierarchical data model and custom serialization is one of the use cases for a general serialization framework. Consequently, it should then also allow for data validation. IMO full coverage validation should be part of the serialization and deserialization procedures and bound to the type+constraints specification.
I'd love to have this option as well. I find myself using msgspec across the whole stack, which means I'm building plenty objects at runtime, and have been using dictionaries less and less. Having this as an option would ensure type safety basically everywhere and not only on the edges.
Has there been any progress on this?