quicktype icon indicating copy to clipboard operation
quicktype copied to clipboard

Support more generic unions

Open schani opened this issue 7 years ago • 3 comments

Right now unions have these restrictions:

  • No more than one class type per union
  • No more than one array type per union
  • No more than one map type per union
  • No more than one enum type per union
  • Unions cannot contain both a class and a map type
  • Unions cannot contain both an enum and the string primitive type

Some of these are more easily lifted than others. The enum restrictions, are easy, for example, provided that the enums don't overlap. If they do, we could bail, or just go by whichever is first in the list. If the string primitive type is in the mix as well, use it if the string we got isn't part of any of the enums.

The one-array and one-map restrictions can be solved by deserializing the first array element and checking which type it is, i.e. if we have to handle A[] | B[] we parse the first element, and if it's an A we have an A[]. There are cases where things are more complicated, in particular:

  • An int[] vs an double[] can be disambiguated by any element, i.e [1,2,3.1415] looks like an int[] at first, until we get to the last element.

  • An empty array or map is undecidable. Do we add a separate union case for that? Do we pick the first array or map type in the oneOf sequence?

The class type case is the most complicated one. Here are some difficulties:

  • There might be cases where the data matches more than one type, such as when empty arrays or maps, or optional properties are involved, or enums and strings.

  • The element that breaks the ambiguity between types can come late. Take, for example, classes A: { foo: string } and B: { foo: string, x: bool }. Until we have seen x, or the end of the object, we don't know whether we're dealing with an A or B.

  • There can be ambiguous elements in addition to late ambiguity-breaking types. For example: A: { arr: int[] } and B: { arr: double[], x: bool}. If we see arr first we have to deserialize it into a double[] and then, if we don't see x, convert it into an int[]. Similar cases apply to maps, enums, and even class types that allow ambiguity (imagine, for example, the arr properties instead of being arrays, being two classes that differ in a single property which is an enum in one, a string in the other).

  • There could be more than one disambiguating element. For example: A: { x: int, y: double } and B: { x: double, y: int }. You can't decide between these two cases until you've seen both x and y.

In two of our target languages, Elm and TypeScript, none of this is an issue. TypeScript only does validation of types, so when it handles a union it just checks that the type satisfies one of the cases. Elm is somewhat similar, in that it adopts the inefficient but very generic deserialization strategy of just trying all possible cases one by one, and using the first one that works.

In many of the other frameworks we don't really have that choice - they don't let us restart reading the JSON for each case (C++ does, because, like Elm, it deserializes the JSON into an intermediate, dynamically typed, form first). In addition to that, I'm not fond of that kind of inefficiency.

I don't think writing a code generator for class types that's super smart and able to disambiguate at the earliest point is feasible. A pretty generic and feasible way to do this is to generate two types for each union type:

  1. The user-facing one that's nice and exactly what you expect.
  2. A messy internal one that is the element-wise union of the classes, i.e. what you get right now. We can already deserialize this, and once we have the result, we can run a simple disambiguation over it and construct the correct nice user-facing type out of it.

schani avatar Feb 03 '18 20:02 schani

I don't think there's a way to make deserializing perfectly unambiguous without some constraints. If all union types always have a clearly defined assignment name, then there can be no ambiguity. An example is { a: double[] }, { b: int[] }. Assuming there is a schema that defines that the object's a property is always of type double[], then there is no ambiguity (b would be another type). Only when trying to infer does it become impossible. I understand that right now the property name isn't taken into account when determining type of the child json node, but perhaps including the property name is necessary. This is effectively how protobuf differentiates: using numerically tagged fields, which are roughly equivalent to unique names.

Another possibility is to disallow mixing primitives and maps/classes/arrays in a union. so { a: true } and { a: { b: "something" } } would never be considered the same type, but { a: { b: "something" } } and { a: { c: false } would be unionized. a in both cases points at a map, so there can be a unified type.

An empty array or map is undecidable.

Maybe if you encounter an empty map or array, consider it null and require the property that contains the empty thing to be optional (so the programmer would know to check). This could even be something that QuickType enforces when reading in the schema, responding with a warning or, more helpfully, an error.

The class type case is the most complicated one.

I didn't know this was even possible! In my experience, hydrating a true class is really complicated, usually because of internal state or work embedded in the constructor. Why even try to make this work when a consumer will probably wrap the plain structures QuickType generates into a stateful class (if they need that) anyway? Maybe I'm misunderstanding what this means though.

kirbysayshi avatar Feb 05 '18 19:02 kirbysayshi

I didn't know this was even possible! In my experience, hydrating a true class is really complicated, usually because of internal state or work embedded in the constructor. Why even try to make this work when a consumer will probably wrap the plain structures QuickType generates into a stateful class (if they need that) anyway? Maybe I'm misunderstanding what this means though.

Let me know if I understand you correctly: What you would expect as the output from a union of two or more class types is not two classes in the target language, but one single class that contains a union of all the properties of the class types? And, I assume, an additional field that tells you which one of the union cases applies to the deserialized JSON? And you'd require that properties with the same name can't have different types in different union cases?

Maybe if you encounter an empty map or array, consider it null and require the property that contains the empty thing to be optional (so the programmer would know to check). This could even be something that QuickType enforces when reading in the schema, responding with a warning or, more helpfully, an error.

I don't like this solution: it changes the semantics of array deserialization depending on whether the class is in a union with another class or not. But a simple change makes it better: we can just initialize all non-optional/nullable arrays as empty. If we don't see them, they remain empty, but that's ok, since they're in a union case that doesn't apply anyway.

Another possibility is to disallow mixing primitives and maps/classes/arrays in a union. so { a: true } and { a: { b: "something" } } would never be considered the same type, but { a: { b: "something" } } and { a: { c: false } would be unionized. a in both cases points at a map, so there can be a unified type.

I'm sorry, I don't get this point. When you're saying "would never be considered the same type" do you mean that we wouldn't support it? I'm also confused about the map thing - why is this important here? And would you want unions of map types to be supported?

schani avatar Feb 05 '18 20:02 schani

Something we can do pretty easily for languages that only do verification, not construction of objects - right now TypeScript is the only one where we do that:

We can keep unions arbitrarily complex and emit the corresponding types if the language supports it. So if we have a union of class types A and B, instead of unifying them into one class, we keep them in the union and then make the TypeScript renderer emit

type ClassUnion = A | B

For languages that don't support that we would still unify the classes.

schani avatar Feb 16 '18 19:02 schani