msgspec icon indicating copy to clipboard operation
msgspec copied to clipboard

Some way to extract unknown fields

Open DeadWisdom opened this issue 2 years ago • 5 comments

Description

My use-case is that I am receiving a bunch of JSON documents. I want to parse, validate, possibly make some changes, and pass them downstream. I don't know all of the data; some of it is opaque. I probably just want to treat the extra fields as msgspec.Raw for this, but maybe bring them in as builtins in a similar case.

The best way I can figure to do this now is to double decode the same json document: once with my struct (dict=True), and then again with dict[str, msgspec.Raw], then copy over the fields that are not in my struct. Kind of awkward and seems redundant.

As a background, these are Activity Pub messages, JSON-LD documents. They have a lot fields and extensions might add arbitrary extra ones.

Thanks for the consideration.

DeadWisdom avatar Jul 25 '23 04:07 DeadWisdom

Thanks for opening this! This is the first really motivating use case I've seen for storing unknown fields on structs. I think something like this would be a good feature to add to msgspec.

Here's a tentative design:

By default structs ignore unknown fields unless the user configures forbid_unknown_fields=True on a struct definition. This is how things work today.

To add support for storing unknown fields for later consumption/reserialization, we'd add a new way to mark a single field on a struct as intended for storing any unknown fields. There's a few way this could be spelled (still waffling about this), but it'll likely be like one of the following:

class Option1(msgspec.Struct):
    x: int
    y: float
    extra: msgspec.ExtraFieldsDict = {}    # using an annotated type alias for dict[str, Any]?

class Option2(msgspec.Struct):
    x: int
    y: float
    extra: msgspec.ExtraFields[dict[str, Any]] = {}    # using an annotated type wrapper?

class Option3(msgspec.Struct):
    x: int
    y: float
    extra: dict[str, Any] = msgspec.field(default={}, collect_extra_fields=True)    # using a field kwarg?

Right now I'm leaning towards option 1, but they all have their own pros/cons. In any case:

  • The python attribute name for storing extra unknown fields is configurable. Here we use extra, but since it's reusing the existing type annotation mechanism you can name it whatever you want.
  • If we go with option 2 or 3, the corresponding type must be a dict[str, T] where T can be any type. An error would be raised if this wasn't correct. Option 1 would have this be an annotated alias to dict[str, T], so there's no way for the user to mess this up. To decode field values into Raw instead of Any you'd specify ExtraFieldsDict[Raw] (using the option 1 syntax).
  • You likely want to specify a default value of {} for this field, but this isn't necessarily required.
  • Any struct may have at most one field designated for collecting unknown fields.

The execution semantics would be:

# to python code `extra` is like any other field. It can be passed to the `__init__`,
# accessed as an attribute, etc...
# This makes it play well with existing dev tools like mypy/pyright, and should
# hopefully match user's intution
a = Option1(x=1, y=2.5)    # extra defaults to its default of {}
b = Option(x=1, y=2.5, extra={"unknown": 3, "fields": 4})

# The `extra` field is just like any other field to python, supporting attribute access
a.extra
#> {"unknown": 3, "fields": 4}

# Unlike pydantic's `__pydantic_extra__` mechanism, unknown kwargs passed to `__init__`
# are treated as an error (not forwarded to `extra`). Likewise unknown attributes are
# also an error (also not forwarded to `extra`). This plays well with existing type checkers.
# Reiterating - to python code `extra` is like any other field. It's only different when
# encoding/decoding.

# When serialized, the fields in `extra` are flattened into the containing object.
msgspec.json.encode(a)
#> b'{"x":1,"y":2.5}'
msgspec.json.encode(b)
#> b'{"x":1,"y":2.5,"unknown":3,"fields":4}'

# When decoding, any unknown fields will be stored in `extra`
msgspec.json.decode(b'{"x": 1, "y": 2.5, "unknown": 3}', type=Option1)
#> Option1(x=1, y=2.5, extra={"unknown": 3})

Given this design, #199 would be a precursor to implement this (since the implementation mechanism would be the same).

Thoughts?

jcrist avatar Jul 26 '23 04:07 jcrist

Yeah, this is great!

I think of your options, extra: msgspec.ExtraFieldsDict[T] = {} makes the most sense to me, if keys really can only be strings. Not sure a default really makes sense, it should get an empty dict.

DeadWisdom avatar Jul 27 '23 03:07 DeadWisdom

So waiting for this I happened upon a pretty interesting pattern that I thought I would share.

What I've done is I've created a basic, underlying, unparsed "Document" class, it looks roughly like this:

class Document:
    source: bytes
    id: str = None

    def __init__(self, source: bytes):
       self.source = source
       self.id = self.decode(Head).id

    def decode(self, t: Type[T]) -> T:
        return msgspec.json.decode(self.source, t) # return this source decoded as the given "Facet"

    def with_changes(self, changes: dict | None = None) -> 'Document':
        ... # return a new document with the given property updates

Now we create a bunch of Structs:

class Head(Struct, ...):
     id: str
     type: str

class Node(Head, ...):
     name: str | None = None
     summary: str | None = None
     owner: Head | None = None

class Activity(Node, ...):
     actor: Head
     object: Head | None = None

class Actor(Node, ...):
     account: str
     ...

Now, when we have a Document, and we need to do something with it, we quickly decode it to just the read-only representation that we need. For instance, to get its type, we decode it to a Head. Downstream we might need it as a Node or an Activity, so we simply decode it to that representation at the time of need. It is naturally polymorphic, because if we tried to decode an Activity, and the Document had no actor, we would get a decode error-- the document doesn't support that representation.

This is a "Facet Pattern", one of the more unknown of the design patterns out there. Normally used in security contexts, since it reduces the interface to a minimal representation for the given context. Here we use it for great effect because msgspec is so fast. For my purposes, I realized it's actually faster to decode as needed rather than decoding a giant robust object ahead of time.

Further, making the objects effectively read-only, since you change the document by requesting a copy with changes is naturally functional and safe. Again leveraging the fact that in many cases it's actually faster to decode and recode with msgspec rather than building methods/process to change the objects in place.

Of course we can also apply memoization, though at this point I wouldn't be surprised if msgspec encoding/decoding ended up being faster than a cache lookup. 🤣

What I really like about this pattern is we get to have a Document blob that can take any value AND well defined, minimal schemas. It's like the gradual typing of data storage. And the backwards compatibility / organic schema evolution ramifications of it are tremendous. Hey clients, you want "application/vnd.com.example.user.v2+json", and you want "application/vnd.com.example.user+json" No problem!

So, I guess, now I don't need this feature. Thanks!

DeadWisdom avatar Aug 03 '23 21:08 DeadWisdom

I think this feature is still relevant. For example kubernetes CRDs allow to keep unknown attributes using x-kubernetes-preserve-unknown-fields: true. So for a msgspec struct in order to be able to handle custom resources will need to implement this feature.

The same would be nice for defining a schema to read/write unknown k8s resources. There are a set of attributes always there (like apiVersion, metadata, kind) but others may differ depending on the resource.

gtsystem avatar Oct 31 '23 10:10 gtsystem

This would fit nicely with JSONSchema addiitionalProperties.

And BTW, patternProperties could be similarly supported:

class SchemaClass(msgspec.Struct):
    countries: Annotated[Mapping[str, CountryDetails], msgspec.PropertyPattern('^\w\w$')] = {} 

countries would be a field in the model, that doesn't exist in the serialized documents, instead it would catch all properties of the owning object that match the pattern.

rafalkrupinski avatar Feb 03 '24 20:02 rafalkrupinski