msgspec icon indicating copy to clipboard operation
msgspec copied to clipboard

Enforcing runtime type validation for Structs (and extending validation to functions)

Open mjkanji opened this issue 2 years ago • 12 comments

Description

Hi there!

Would it be possible to add an optional flag that enforces type validation for Structs at runtime? While I understand your rationale for the current behavior (static type checker + msgspec for external data is more performant), not all teams have the same level of maturity in terms of adopting best practices. More importantly, not all packages are 100% type-annotated, and even if they were, unlike in a statically typed language, there are no guarantees that the upstream implementation is actually accurate/adheres to those type hints.

In addition, having an optional feature like this could also allow msgspec to potentially serve as a replacement for something like typeguard by having a function decorator like Pydantic's validate_call.

Having the option for msgspec to serve as a fully-featured, more performant alternative to Pydantic (+ typeguard) would also allow the project to become more popular -- it would certainly encourage me to make the switch if these features were built-in, even under an optional flag somewhere.

Given that msgspec already has all the logic for type validation built-in, it just feels like a missed opportunity that something that would be relatively easy to add (correct me, if I'm wrong here) and could expand the user base/"target market" of the library quite significantly is not currently supported.

PS: I have also reviewed the only issue I could find with a similar request (which offers a great solution, but it's always nicer when something's built-in).

mjkanji avatar Aug 11 '23 22:08 mjkanji

Thanks for opening this. I think we're at the point now where this is something we could support, but I'd definitely want it to be off by default.

I also think I'd avoid doing any conversion in the runtime type-checked __init__ - we'd match the behavior that mypy would accept but would check it at runtime.

Example user session:

import datetime

import msgspec


# Enable runtime type checking in init (not 100% happy with this keyword name,
# suggestions welcome).
class Example(msgspec.Struct, typecheck_init=True):
    name: str
    birthday: datetime.date


# This would pass
Example("alice", datetime.date(1990, 1, 2))

# This would error since we don't coerce types like convert, just type check
Example("alice", "1990-01-02")

Would this work for you? Can you comment more on your use case and how msgspec.convert is insufficient for your needs?

jcrist avatar Aug 15 '23 01:08 jcrist

Thank you for considering the request! And the example you showed above (a Struct that is validated at initialization) is exactly the workflow I'm looking for.

I wasn't aware of the msgspec.convert function previously. Having read through the docs for it, I suppose it also allows for the same functionality, but I'd still prefer to have a flow like the one in your example, where I can directly call the Struct and pass in the relevant args, instead of first defining the args as a dict and then doing msgspec.convert(args_dict, SomeStruct), as shown in the docs here. Quality-of-life improvements definitely matter.

Finally, a decorator for enforcing function arguments that automatically creates the corresponding struct and checks that the types match (like validate_call in Pydantic) would also be pretty amazing.

mjkanji avatar Aug 16 '23 01:08 mjkanji

Makes sense. I think both an option in msgspec.Struct types and a function decorator are valid and useful features we could implement here.

One last question - what are your thoughts on sampling when validating collections (as beartype and typeguard do)? When type checking a collection (list, set, dict, ...), these libraries only validate one element by default (this is applied recursively throughout the type). The argument being that validating a sample is likely enough to catch common mistakes while being O(1) in datasize. I think we'd make this configurable if added, but I'm not sure what the default should be.

Pros of sampling:

  • It's much less overhead, making enabling runtime more reasonable
  • Users are unlikely to change default settings, we might want to fix the fast option by default
  • Most collections are homogeneously typed; if one value has an incorrect type, all values probably have an incorrect type

Cons of sampling:

  • It's only a sample and may miss a bug. Users may expect the data to be fully checked, not probabilisticly checked. If most errors are caught by the decorator the user may be unprepared for the rare occasion they slip through.

Given the above, I'm leaning towards defaulting to checking the complete values by default, but:

  • Emphasizing heavily in the documentation that the sampling approach is more efficient and will catch most internal bugs
  • Perhaps having the default value configured via an environment variable so users can set MSGSPEC_VALIDATE=sample or something to enable sampling instead of the default of complete checking. This could also support defining MSGSPEC_VALIDATE=0 to disable all runtime validation checking, which may be useful in a production setting (idk).

jcrist avatar Aug 16 '23 04:08 jcrist

One last question - what are your thoughts on sampling when validating collections (as beartype and typeguard do)?

Great question! I hadn't really considered the long tail of validating 1M-long lists when I created the issue. Your question did encourage me to go on a detour of the beartype repo, and it seems that its author has been considering a O(log n) alternative, which randomly sample greater than one sample. I definitely like that idea, since relying on a single sample feels a little finicky to me.

Ultimately, though, it should definitely be something that's configurable from day-one (unlike bearguard) and left to the user's control. I don't think a validation library that doesn't let you validate (all) the data makes sense.

Regarding the default setting, I have the following thoughts:

  • If I created a JSON file with a 1M-long list, msgspec would presumably validate the type for each element in the list when decoding the file? If so, defaulting to sampling would bifurcate the behaviour of Struct based on whether we're validating from a __init__ call or a .decode call. Alternatively, if strategy controls behaviour for both cases and they set strategy="sample" because they mostly expect runtime validation, we leave the possibility of incorrect values if they do, even occasionally, decode external data. As an aside, the fact that one may want different behaviour while deserializing and as part of an __init__ call probably requires some thought into what the final UX and docs might look like to avoid a lot of confusion there.

  • In general, I think users mostly expect everything to be validated. bearguard's insistence on O(1) guarantees definitely came off as very unusual to me. If I was concerned with utmost performance at the cost of correctness, then I probably wouldn't be using a validation library to begin with.

  • How often do people actually pass in a 1M-long list as an argument to anything? The only use cases I can think of are dealing with data frames or ML training data, but in both those cases, there are more specialized tools and libraries (pandas, polars, etc.) that already have built-in validation to ensure the types fit. These tools also have their own serialization/deserialization methods to directly read/write data, so I can't imagine anyone would want to use msgspec as part of that workflow.

  • One other option, that the author of beartype also mentioned in an issue, is to have some heuristic based on the length of the collection/container. I suppose a very simple fix in this regard would be check for any max_length constraints before doing a potential very expensive type-check on all elements (though this, of course, doesn't apply to the .decode flow because the len is not known beforehand). Beyond that, if one could somehow specify the strategy based on the len of the container, that'd be a nice-to-have, at the cost of making the API more complex.

So, with all that said, I think defaulting to checking all elements in a collection, while also giving the user the ability to choose otherwise, makes the most sense.

mjkanji avatar Aug 16 '23 06:08 mjkanji

Also wanted to mention this video from pydantic's author on the move to Rust for v2.0. In it, he talks about some of the same concerns (validating elements in collections) in passing. Someone also asked if this is currently parallelized and he answered no; that could be a decent performance uplift, if type-checking for collections was parallelized.

mjkanji avatar Aug 16 '23 06:08 mjkanji

+1 for (optional, off by default) validation on __init__. If this gets implemented, a natural next step would be to support assignment validation, something like:

class Example(msgspec.Struct, validate_init=True, validate_assignment=True):
    name: str
    birthday: datetime.date

gsakkis avatar Aug 20 '23 12:08 gsakkis

Any update about this? I'm writing an SDK for a web platform, like this:

async def get_info() -> InfoStruct:
    data: dict[str, Any] = await get_data()

    return InfoStruct(
        name=process_name(data["user_name"])
    )

I need to write the data dict -> Struct logic instead of directly define the struct with name, and convert the data to struct, because:

  1. The process_name function should be used to validate & convert name field from API, and msgspec is not supporting custom validate or convert function now.

  2. User may use asdict function to convert the struct into Python dict, and store them in MongoDB etc. we should get rid of the bad API fields name and use our own name.

And we decided to manually create InfoStruct object because we can get auto complete as well as type check there, if we define a dict and convert it to the struct, the DX will be bad.

Without a validate_init option, we should validate all the structs by convert it to dict and convert it back, like this:

class MyStruct(Struct):
    def validate(self) -> None:
        msgspec.convert(msgspec.to_builtins(self), type=self)

Which is annoying and lead to some performance issue, while we are doing useless work just for validation.

So I'm thinking about a new option like this:

class MyStruct(Struct, validate_init=True):
    pass

Just like others refered a few months ago.

By the way, my workaround about this issue:

from typing import Annotated
from msgspec import Struct, convert, to_builtins, Meta

class MyStruct(Struct):
    def _validate(self) -> "MyStruct":
        return convert(to_builtins(self), type=self.__class__)

class TestStruct(MyStruct):
    test: Annotated[int, Meta(ge=0)]

print("Should be fine:", TestStruct(test=1)._validate())

print("The following one should raise ValidationError")
print(TestStruct(test=-1)._validate())

You can use validation on init by simple add ._validate() after your struct creation, it will return the struct back.

Add self._validate() in __post_init__ won't work because it will create an infinite loop.

About performance:

I create a micro benckmark about this, use a sturct with 10 fields, each one is an int with Meta(gt=0).

Validate this 100,000 times cost about 110ms on my machine, so it shouldn't be a big problem, but still nice to see this feature become build-in one.

FHU-yezi avatar Dec 13 '23 13:12 FHU-yezi

* How often do people actually pass in a 1M-long list as an argument to anything? The only use cases I can think of are dealing with data frames or ML training data, but in both those cases, there are more specialized tools and libraries (`pandas`, `polars`, etc.) that already have built-in validation to ensure the types fit. These tools also have their own serialization/deserialization methods to directly read/write data, so I can't imagine anyone would want to use `msgspec` as part of that workflow.

@mjkanji I'm actually doing this. Using custom (tabular) types with a hierarchical data model and custom serialization is one of the use cases for a general serialization framework. Consequently, it should then also allow for data validation. IMO full coverage validation should be part of the serialization and deserialization procedures and bound to the type+constraints specification.

fungs avatar Dec 15 '23 10:12 fungs

I'd love to have this option as well. I find myself using msgspec across the whole stack, which means I'm building plenty objects at runtime, and have been using dictionaries less and less. Having this as an option would ensure type safety basically everywhere and not only on the edges.

ccrvlh avatar Feb 17 '24 05:02 ccrvlh

Has there been any progress on this?

ofek avatar Jul 17 '24 15:07 ofek