desert icon indicating copy to clipboard operation
desert copied to clipboard

Explicit type specification for unions

Open altendky opened this issue 5 years ago • 54 comments

Sometimes it would be good to store a type identifier in union fields. This would be especially true for sparsely populated data and unions containing str (described below).

Consider:

import typing

import attr
import desert


@attr.dataclass
class A:
    color: str


@attr.dataclass
class B:
    color: str


@attr.dataclass
class C:
    things: typing.List[typing.Union[A, B]]

We then create some instances:

instances = C(
    things=[
        A(color='red'),
        A(color='green'),
        B(color='blue'),
    ],
)

Theoretically that would serialize to (it doesn't seem to so there will presumably be another issue report):

{"things": [{"color": "red"}, {"color": "green"}, {"color": "blue"}]}

Which leaves us wondering what any of the things are.

On the flip side, if we take an example in the tests where there is a field t.Union[int, str] then adjust it to include a custom class after str such as t.Union[int, str, MyClass] then str will 'always' work and MyClass instances will be serialized to a str such as 'MyClass()' (well, for an attrs class with such a repr() anyways).

tl;dr, marshmallow-union has some pretty significant limitations. I got myself confused and started creating a solution over in marshmallow-polyfield https://github.com/Bachmann1234/marshmallow-polyfield/pull/34 which would introduce an extra layer to the serialization and include a string to designate the type of that element. No more guessing.

{
    "things": [
        {
            "type": "A",
            "value": {
                "color": "red",
            }
        },
        {
            "type": "A",
            "value": {
                "color": "green",
            }
        },
        {
            "type": "B",
            "value": {
                "color": "blue",
            }
        }
    ]
}

Perhaps desert is interested in using marshmallow-polyfield for that feature? Perhaps desert would rather the feature be added to marshmallow-union?

In the mean time I'll try to find time to work my way through reporting and possibly fixing some other issues.

altendky avatar Jan 20 '20 04:01 altendky

Polyfield looks interesting. I think it'd be worthwhile to make sure we're not duplicating ideas that may have already been well worked out in, say, marshmallow-jsonschema.

python-desert avatar Jan 20 '20 05:01 python-desert

To get it on the record and out of my tab list...

https://github.com/OAI/OpenAPI-Specification/issues/57#issuecomment-56224153

altendky avatar Jan 23 '20 20:01 altendky

OpenAPI 3.0 uses an extended subset of JSON Schema Specification Wright Draft 00 (aka Draft 5) to describe the data formats. “Extended subset” means that some keywords are supported and some are not, some keywords have slightly different usage than in JSON Schema, and additional keywords are introduced.

https://swagger.io/docs/specification/data-models/keywords/

python-desert avatar Jan 23 '20 21:01 python-desert

As I think about looking at my own files I do really like the lack of an extra layer and the explicit type specification of every item. So I'm thinking that perhaps both options would be good. Of course, two ways has plenty of cost so...

I suspect the wrapper-layer approach is only going to be of significant interest to people that do not want types indicated in the serialized data but that also want functional unions. The wrapper layer is the least intrusive to the serialized form. This is a property of a field. Just union fields? Maybe, I haven't thought about any others. The identifer/type relationship is local.

The integrated form is nice when you do it everywhere. It is only one extra line per object and no indentation. This is more of a property of the class to be serialized. The identifier/type relationship is more 'global'. It doesn't have to literally be global but it isn't part of a field or a class.

I think the two may be able to cleanly coexist not only as features in desert but even in single serialized data blobs. I suppose there could be some silliness if you used different names for the two methods.

{
    "type": "MyClass",
    "value": {
        "_type": "color_stuff",
        "red": 0,
        "green": 1,
        "blue": 0.5
    }
}

So... meh. Maybe if all classes to be serialized in the union are internally typed then the union automatically drops the external typing layer? Too implicit? Maybe there are hazards in some scenarios such as serializing with one set of classes then adding a new option that isn't internally typed and the layers are expected but not present? The developer must choose but we give warnings or otherwise provide a means of testing that if their choice isn't obviously sensible?

There's also the matter of subclasses that I put in the WIP list over in polyfield but haven't thought through yet. IIRC, the types registered with the union are expecting to dict-lookup match the types of the actual objects.

No conclusions here I suppose, just some thoughts to revisit later.

altendky avatar Jan 25 '20 03:01 altendky

Where's the indication that 1 is a float? Do we only want that if the defining schema has a union field for "green" (which it presumably doesn't here)?

python-desert avatar Jan 26 '20 01:01 python-desert

Correct, no union for green. It was meant to just be a float.

The indication is in the schema (unless marshmallow does 1.0, I don't know). For people that want the integrated type form for native JSON (or more general) types, we could provide fields that have integrated types and a way to trigger them. Or I suppose the same for wrapped-layer.

For reference, I use decimal.Decimals as strings for all my 'floats'.

altendky avatar Jan 26 '20 01:01 altendky

If we are assuming the loader's schema is the same as the dumpers, and key ordering is canonicalized or maintained, there could be a List[int] as a top-level value where the ith element of the list gives the index of the Union[...] entry of the ith union-typed value in the data.

For example,

{
   "data": {"things": [{"color": "red"}, {"color": "green"}, {"color": "blue"}]},
   "union_indexes": [0,0,1],
   "schema_hash": "01abef8" # or at least "schema_name"
}

python-desert avatar Jan 27 '20 00:01 python-desert

Hi all, I'm new to this project 👋Very interested in seeing this issue (and #89) get resolved. I've tried both dataclass-json and marshmallow-dataclass and gotten pretty disappointed, so I'm hoping desert can be the production-grade dataclass serialization library I need for the projects I maintain. Thank you for all the hard work you all have put into this library 🙏

I happened to come across a good writeup of the various options for explicit type specification for unions, courtesy of the Serde library that handles serialization in the Rust programming language: https://serde.rs/enum-representations.html

I think supporting a choice of explicit representation (and if possible, compatibility with Serde for cross-language applications) would really set this library apart from the competition, and if that's the direction chosen by the maintainers, I'd love to be part of that journey.

obi1kenobi avatar Mar 09 '20 20:03 obi1kenobi

Hey @obi1kenobi, welcome!

That Serde link gives a great rundown of our options. @altendky's polyfield link above looks like what Serde calls "adjacently tagged", and marshmallow-union looks like the "untagged" option.

I'm +1 on providing built-in support for all four options discussed there.

python-desert avatar Mar 09 '20 22:03 python-desert

So let's see if my original example can still be used to show the four options here.

Class Definitions

import typing

import attr
import desert


@attr.dataclass
class A:
    color: str


@attr.dataclass
class B:
    color: str


@attr.dataclass
class C:
    things: typing.List[typing.Union[A, B]]

Sample Data

instances = C(
    things=[
        A(color='red'),
        A(color='green'),
        B(color='blue'),
    ],
)

I'll use some 'non-standard' formatting and hopefully it'll end up more readable including considerations of fitting on the screen. I am also only applying it to the members of the attribute list. I think that may be an orthogonal question. 1) How to tag when being explicit and 2) whether to tag even when there isn't any ambiguity.

Externally Tagged

{
    "things": [
        {"A": {"color": "red"}},
        {"A": {"color": "green"}},
        {"B": {"color": "blue"}}
    ]
}

Internally Tagged

{
    "things": [
        {"type": "A", "color": "red"},
        {"type": "A", "color": "green"},
        {"type": "B", "color": "blue"}
    ]
}

Adjacently Tagged

{
    "things": [
        {"t": "A", "c": {"color": "red"}},
        {"t": "A", "c": {"color": "green"}},
        {"t": "B", "c": {"color": "blue"}}
    ]
}

Untagged

{
    "things": [
        {"color": "red"},
        {"color": "green"},
        {"color": "blue"}
    ]
}

If something looks wrong I think it will be worth updating in place here so just let me know and I'll tweak it.

Are the specific strings "type", "t", and "c" significant? For the internally tagged it seems maybe useful to default to something that can't be an attribute (normally anyways) like "@type" or whatever. For adjacent tagging I would have (well did :]) use full words.

As noted in the provided link (thanks), internal tagging doesn't apply to things serialized as JSON arrays rather than objects. If we wanted we could make the first element be the tag I suppose.

Side note, should we talk more about marshalling to Python dict/list/int/float/str/bool/None rather than serializing to JSON object/array/number/true/false/null? If we go big (instead of going home) and try to allow for multiple backends etc this seems maybe more relevant? I dunno, I don't have this big architecture plan in my head yet. Going big might also suggest it would be useful to implement the disambiguation in desert itself rather than delegating to a new implementation specific to each backend. Maybe...

altendky avatar Mar 10 '20 02:03 altendky

@altendky

Are the specific strings "type", "t", and "c" significant?

I'm guessing they come from the #[serde(tag = "t", content = "c")] declaration just above the enum.

python-desert avatar Mar 10 '20 02:03 python-desert

@altendky

Side note, should we talk more about marshalling to Python dict/list/int/float/str/bool/None rather than serializing to JSON object/array/number/true/false/null?

I think that's what we're doing already, since schema.dump(obj) returns a dict (or other appropriate type). Or do you mean something else?

python-desert avatar Mar 10 '20 02:03 python-desert

I may just mean that I've barely used desert yet, yeah. :| Though the terminology question still stands albeit only at 8 search hits to 1 for serialize vs. marshal.

altendky avatar Mar 10 '20 02:03 altendky

@altendky

  1. whether to tag even when there isn't any ambiguity.

Edit: moved this idea into a new issue.

python-desert avatar Mar 10 '20 02:03 python-desert

Will it be easiest for us to use polyfield as a dependency to get this feature or should we do it independently?

python-desert avatar Mar 10 '20 03:03 python-desert

After a brief look, I think polyfield is focused on unioning over schemas, not other types. So while that'd work fine in the example above, it'd be awkward in the case of int, float, or other non-struct types.

python-desert avatar Mar 10 '20 04:03 python-desert

If I'm thinking about this right we need

class AdjacentlyTaggedUnion(marshmallow.fields.Field): ...
class ExternallyTaggedUnion(marshmallow.fields.Field): ...
class InternallyTaggedUnion(marshmallow.fields.Field): ...
class UntaggedUnion(marshmallow.fields.Field): ...

and to pick a default for what typing.Union should go to. Since "internally tagged" doesn't work in arrays, "adjacent" and "external" ought to have higher development priority.

python-desert avatar Mar 10 '20 04:03 python-desert

I think this is a feature which is missing from desert, so I really like where this is going, but I can't help thinking that this is blurring the lines between desert and marshmallow. Or to put it another way: Shouldn't really these fields belong in marshmallow or a variant thereof? My impression is that desert exists as a shim on top of marshmallow to allow for dataclass and attr, but rely on marshmallow for the actual serialization. Won't putting type fields on unions into desert make desert and marshmallow incompatible? What is the intended relationship between desert and marshmallow?

sveinse avatar Mar 10 '20 09:03 sveinse

I think this is a feature which is missing from desert, so I really like where this is going, but I can't help thinking that this is blurring the lines between desert and marshmallow. Or to put it another way: Shouldn't really these fields belong in marshmallow or a variant thereof? My impression is that desert exists as a shim on top of marshmallow to allow for dataclass and attr, but rely on marshmallow for the actual serialization. Won't putting type fields on unions into desert make desert and marshmallow incompatible? What is the intended relationship between desert and marshmallow?

I'm agnostic to the final decision on where this code should live — I'd be happy with anything that makes for working code. I just wanted to share this link, where as of about a year ago it seems to say that marshmallow has no plans to introduce a union field of any kind in the near future: https://github.com/marshmallow-code/marshmallow/issues/1191

I think @python-desert's suggested approach would produce field types that are compatible with marshmallow, and in principle could even be contributed upstream (or extracted into a separate marshmallow extension package) if everyone involved is amenable to that. So I think the first priority is getting to a working and stable implementation, since the paths from there onward don't seem to be too bad.

If I'm thinking about this right we need

class AdjacentlyTaggedUnion(marshmallow.fields.Field): ...
class ExternallyTaggedUnion(marshmallow.fields.Field): ...
class InternallyTaggedUnion(marshmallow.fields.Field): ...
class UntaggedUnion(marshmallow.fields.Field): ...

and to pick a default for what typing.Union should go to. Since "internally tagged" doesn't work in arrays, "adjacent" and "external" ought to have higher development priority.

This looks great to me, and I've given the choice of default some thought. I'd like to make a case for adjacently-tagged to be the default.

I think the default choice should always work, with minimum tweaking of configuration and other "if"s and "but"s.

  • This rules out untagged, since it's impossible to differentiate the string "123" from the string serialization of Decimal(123).
  • It also rules out interally-tagged, because we have to name the internal key that holds the name of the type — picking anything, including something like @type, means that we're betting that our users can't construct an object that has a value with that as its key, and that's not a bet I'd be willing to make (see this link for a situation where someone bet that users can't make a certain value occur, and lost).
  • That leaves adjacently-tagged and externally-tagged. Both of these are good options, and externally-tagged is Serde's default so it has that going for it as well. But externally-tagged unions have to cope with validating the serialized input in one additional way compared to adjacently-tagged — to reuse the example setup from above:
{
    "things": [
        {
            "A": {"color": "green"}, 
            "B": {"color": "blue"}
        },
    ]
}

The object inside things is invalid: it is supposed to be a single union field, but holds two union variants, an A and a B, within. This input wasn't generated by valid code, but people deserialize invalid things all the time — they might have been erroneously hand-written, or generated by another library. It's impossible for the deserializer to figure out which of the two variants should be the output here, and I'd argue this needs to result in an exception rather than a "best-effort" parse of some sort.

This situation isn't possible with adjacently-tagged unions. There, the "type" key can only have one value, and the "value" key only holds the data for one value. While there may be additional, unexpected keys in the dict, this is also true of externally-tagged unions, so the same error-handling code has to exist in both cases. But the "multiple-variants" error case is impossible by design, and I'd argue that's a win we'd like for our default choice.

So long as we pick an always-safe choice for the default, and allow users to consciously change that choice (if the new choice is also safe for their dataclasses) to get advantages like more compact representation, I think that will result in happy users. My team and I definitely will be happy 😄

obi1kenobi avatar Mar 10 '20 14:03 obi1kenobi

It seemed that if I finished it up that Bachmann1234/marshmallow-polyfield#34, which added adjacently tagged as an explicit union option, would get accepted. I would expect they would accept the rest. adamboche/python-marshmallow-union#27 refers over to marshmallow-polyfield. Or we could make our own third library for this... I have some hesitation if there are plans to support several backends that the number of libraries would grow annoying to tack on the features we need for each backend, but I may well be looking way too far ahead. And yeah, the code could be developed here in a PR, compared with marshmallow-polyfield, and then a decision made where the code should actually live. It would probably be nice to just work in one repo while developing unless we think that the probable final destination is polyfield. I'll have to look back at that, I don't really remember the structure over there but iirc I would have done it a bit differently if working from scratch.

altendky avatar Mar 10 '20 15:03 altendky

@sveinse

Shouldn't really these fields belong in marshmallow or a variant thereof?

It could make sense to put these changes in a more narrowly marshmallow-focused library. marshmallow-union could be a good fit. Though as @altendky says, working in one library is easier than coordinating changes across multiple (at least while we're still settling an API).

Won't putting type fields on unions into desert make desert and marshmallow incompatible?

Nothing particularly jumps out at me. What's the incompatibility you're thinking of?

python-desert avatar Mar 10 '20 19:03 python-desert

Let's talk interface.

thing: Union[A, B]

will be treated like

thing: Union[A, B] = AdjacentlyTaggedUnion(...)

What's the signature?

Edit: removed comments that didn't make sense.

python-desert avatar Mar 10 '20 20:03 python-desert

So yeah, there's the super flexible 'callable that decides how to deserialize' but that feels possibly excessively flexible to me. Though perhaps it is just a case of giving a default implementation that can be passed that doesn't involve that flexibility. I suspect that most of the time the user should have a registry of classes and their tags that are used to go back and forth between the two. That could be specific to the particular union but I would think that often it could just be 'global'. In graham I specify that in the class decorator. The registry could even allow for colliding names and only have an issue if the colliding classes end up used in an ambiguous place. Might want to have that be an opt-in loosening of the rules and error by default though.

So signature? Let's see if refreshing myself as to what I did over in polyfield ends up creating a useful summary for others. You need (with nominal defaults as reference):

  • class -> schema
    • desert.schema(cls)
  • class (or schema) -> tag
    • cls.__name__
  • tag -> class (or schema)
    • reverse above...
  • schema -> class
    • reverse above...

I reduced this to a class to schema mapping and a class to tag mapping that acted as an override for the default cls.__name__ behavior. Perhaps this 'override' ought to be brought out and handled separately. Another function that just returns a default class to name dict that you can .update(). Or maybe this is just another argument for an explicit registry class. I'll note that I did make the adjacent tagged schema a parameter as well so that it was somewhat configurable. Though I think not starting with that would be good. Once we have separate implementations we can identify overlap and ways to avoid repetition.

So if there's a registry, a parameter with it would simply be the signature for the fields? Or maybe leave it as a class-to-tag and a tag-to-class callable where normally people would specify my_registry.to_tag and my_registry.to_class. Except normally they wouldn't even do that and would just accept some global state holding a global registry? Or am I getting too evil here... either way, having the global registry is just a matter of registry = desert.Registry() and defaulting to it or whatever so it doesn't make a huge difference to design of the rest I don't think.

altendky avatar Mar 11 '20 02:03 altendky

There may be some room for reasonable defaults and avoiding registries (especially global ones) in many common cases. While classes' __name__ values may collide, I suspect this is rather rare — and definitely something that can be checked for at schema construction time. So perhaps by default we could check the union variants and if their names are distinct, simply use them? And obviously allow (and if conflicting names are present, require) the user to provide their own variant-to-name mapping.

I have a mild preference for using dict() rather than callables to map (e.g.) between variants and tags, for the sake of easier validation and testing. Inverting a dict is relatively straightforward and easily checkable at schema construction time, while inverting general functions programmatically is beyond my skill level :)

obi1kenobi avatar Mar 11 '20 02:03 obi1kenobi

As I mentioned above, I think we want to diverge from polyfield in that all of those "schema"s in @altendky's bullet list should be fields for us. That way we can support non-dataclass union entries.

Do we just trust that python type<->field, python type <-> tag, field <-> tag are all one-to-one correspondences (bijective)? If they are, then it doesn't matter whether the tag comes from the python type or the field, but otherwise maybe it does. There are a bunch of "what if there are two _ for a given _" cases which I haven't thought through. Is it the type's job to determine the tag or the field's job?

python-desert avatar Mar 11 '20 02:03 python-desert

If my understanding of the bijective correspondences question is correct, it seems to me that it may be possible to locally (i.e. on a per-union-field basis) at schema construction time verify the bijective property for the default (type-provided) mappings, or request that the user provide a custom (i.e. field-provided + also verified) one otherwise.

Does that sound reasonable, or am I missing some subtlety in the question? I admit I'm a newbie when it comes to marshmallow and especially desert — apologies for my ignorance.

obi1kenobi avatar Mar 11 '20 03:03 obi1kenobi

Suppose we're doing this for marshmallow independent of the dataclass logic. Do we allow

AdjacentlyTaggedUnion([AField(), BField()], ...)

or only

AdjacentlyTaggedUnion({'A': AField(), 'B': BField()}, ...)

python-desert avatar Mar 11 '20 03:03 python-desert

Eh let's start with the latter, we can always add more stuff if we want to.

python-desert avatar Mar 11 '20 03:03 python-desert

A concrete implementation of AdjacentlyTaggedUnion will probably give us more to talk about. Anybody interested in taking a crack at it?

python-desert avatar Mar 11 '20 04:03 python-desert

So we do have one over in the polyfield PR, but yes, I was considering needing some code to help me think. If someone else wants to take a stab, go for it. I don't know offhand if it will be hours or days or... until I get to it. Otherwise, yeah, I'll probably get to it at some point.

As to the registry and having a default, a piece of that idea is keeping all of the control in one place. If you can pass in a thing that has some logic and then the union field also provides some default logic then you have to go two places to figure out what is going on. If you put the default logic in a default registry then you only look at the registry. Also, the 'this set of "mappings" is consistent' check would be the responsibility of the registry object so the two callables passed to the field wouldn't be a thing the field would validate. It either works well or doesn't.

altendky avatar Mar 11 '20 14:03 altendky