Missings.jl icon indicating copy to clipboard operation
Missings.jl copied to clipboard

Union{T,Null} inference in structs

Open davidanthoff opened this issue 8 years ago • 54 comments

Say we have a struct like this:

struct Foo{T}
    v::T
end

And a function like this:

function bar(x)
    x = rand() > 0.5 : 0. : null
    return Foo(x)
end

Then the return type of this is currently inferred as Union{Foo{Float64}, Foo{Nulls.Null}}. This won't work for the whole Query.jl design, I would need the inferred return type here to be Foo{Union{Float64,Null}}.

But here is the twist: I'm not convinced that this would be the right thing to do in general, i.e. that changing the current julia behavior across the board to something that squares with the needs of Query.jl would be a good idea. Note that this is not an issue with the current approach taken in Query.jl where I use the DataValue type from the DataValues.jl package to represent missing values.

davidanthoff avatar May 14 '17 17:05 davidanthoff

Can you give a concrete example where this behavior is a problem?

nalimilan avatar May 14 '17 17:05 nalimilan

So I've been experimenting w/ porting DataStreams/CSV/DataFrames over to use Nulls.jl over the weekend and actually ran into this a little. My issue was passing the parsed value from CSV to be stored in the DataFrame; at first, I was running into issues because I was allocating the DataFrame vectors as Vector{Int}, Vector{Float64}, Vector{Date}, etc. but then trying to pass a null and there were type errors ("can't assign ::Null to ::Date", etc.). The key here is to declare the field types to be Union{T, Null}, so in your case

using Nulls

struct Foo3{T}
    v::T
end

function bar3(x)
    x = rand() > 0.5 ? 0. : null
    return Foo3{?(Float64)}(x)
end

In this case now, @code_warntype bar3(1) correctly has Foo3{Union{Float64, Nulls.Null}} as the return type, and both floats and nulls can be passed as the inner value w/o issue.

So all that to say that I recognize the issue you're talking about here and ran into it myself, but the work-around isn't actually a problem once you stop to think about it. It just takes a little more pre-declaring container types instead of purely relying on inferring the type parameters.

quinnj avatar May 15 '17 14:05 quinnj

For a query like this:

@from i in df begin
    @select {i.a, i.b}
end

we will end up with an anonymous function like this somewhere deep inside Query.jl: i -> Foo(i.a, i.b). The type of Foo is

struct Foo{T1,T2}
    a::T1
    b::T2
end

This anonymous function will be called with instances of another type, say Bar, that represents one row from the source:

struct Bar{T1,T2}
    a::T1
    b::T2
end

The specific concrete type that would be passed into this anonymous function might be Foo{Union{Int,Null},Union{String,Null}}.

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

@from i in df begin
    @select @NT(a::?(Int) = i.a, b::?(String) = i.b)
end

davidanthoff avatar May 15 '17 17:05 davidanthoff

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

Couldn't the query macros do that work automatically?

nalimilan avatar May 15 '17 17:05 nalimilan

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

Couldn't the query macros do that work automatically?

No, the macro doesn't even know that the is will be Bars. You can for example be in a situation where the is will be Foo{Int,String,}, and in that case the anonymous function should return Bar{Int,String} instances, so if the macro rewrote the anonymous function it would break that case.

davidanthoff avatar May 15 '17 18:05 davidanthoff

I think this problem is actually just an instance of two issues that were raised and discussed in the original julep (but never resolved, as far as I can tell).

The core of the problem is this: if you have a higher-order function and pass a procedural parameter that returns a composite type and it operates over one or more collections with element type Union{T,Null}, things get very messy. This affects a large number of use cases beyond Query.jl: dictionary generators, list comprehensions, map and broadcast (and potentially more, I haven't gone through this in a systematic way).

The two issues from the julep (and the discussion in the PR) that are relevant here are 1) John's rational for preferring Union{Some{T},Null} over Union{T,Null} and 2) @andyferris's arguments for using Union{T,Null{T}}. Neither approach would solve the problem, but each points to one of the two problems I'm discussing in this issue.

John's reasoning for preferring Union{Some{T},Null} over Union{T,Null} was that otherwise an f is bound to always return exactly the same thing whether it is called with a T or in a Union{T,Null} setting. But when f returns a composite type, we might actually want to create a composite type that has a Union field if it is called in a context where a missing value can be passed. That is fundamentally not possible with the Union{T,Null} approach, but it bites us with these higher-order functions: we would like them to typically create say an array or a Dict that has elements of the composite type that have Union{T,Null} fields, but instead it will create arrays where the element types are unions of different composite types. Note that this really starts to break once you look at composite types with many fields, you would essentially end up with an array who's element type is a union of 2^number_of_fields different composite types.

@andyferris's reason for preferring Union{T,Null{T}} was that there are cases where you need to return something from a function that is different depending on the type of the missing value. This problem also shows up in the problem of this issue: we need to be able to have an f that returns a composite type that has a Union{T,Null} field. But if we are just called with a Null, without any info about T, we can't do that, we can only return a composite type where the field is of type Null. So higher order functions like map will again return a stream of elements that all have different types, and we would need an array that holds unions of these composite types, instead of an array that holds composite types with union fields. Note also that the main problem with parameterizing the type of a null is the counterfactual return type problem. But if we go with a white listing approach, I think that problem essentially goes away.

The only way to solve both problems with the Union approach that I can see right now would be to use Union{Some{T}, Null{T}}. It would still require the kind of rewrite of a say @select statement in a query that @nalimilan suggested above, but I think with that type this might work (but mainly I haven't thought this one through, so who knows). It won't work easily for any of the other cases I mentioned above (dictionary generators, list comprehensions etc.).

Here is a simple examples that illustrate this problem:

using Nulls

A = Union{Int,Null}[1,2,null,4,null]
B = Union{Int,Null}[1,null,3,4,null]

C = map(Pair, A, B)

Note how this array will have elements with four different concrete types: Pair{Int,Int}, Pair{Int,Null}, Pair{Null,Int} and Pair{Null,Null}. What we don't get is Pair{Union{T,Null},Union{T,Null}}. For types with more fields this can go from 2^2 cases to 2^number_of_fields.

So I think this issue is essentially about two problems with Union{T,Null} that were identified in the julep and its discussion, and at least one of them seemed to be the reason why John actually recommended to not use Union{T,Null}. Was there any other discussion somewhere about these two problems where this was resolved?

davidanthoff avatar May 17 '17 23:05 davidanthoff

Yes, this is still the same original problem here, but I'm also not sure why the same solution doesn't apply

julia> C = map((x,y)->Pair{?(Int),?(Int)}(x, y), A, B)
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null)

julia>  Pair{?Int, ?Int}[Pair(x, y) for (x, y) in zip(A, B)]
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null)

i.e. the container just has to be declared/tagged w/ the right ?(T) type parameters upon construction. In my experience porting DataStreams over so far, this was a trivial update and results in no other issues; when streaming, we just ensure the Sink allocates its internal storage as Vector{?T} (if it even needs to allocate at all).

Going back over the John/Andy discussions, I'm inclined to say that it's probably a more theoretical problem than practical; especially since I believe most users won't even run into this issue in their own code. They'll more than likely be calling higher-level libraries like DataStreams/Query which can handle the typing under the hood. Even if they're extracting DataTable/Frame columns to do their own operations, they're already set since the columns themselves will already be correctly typed with Vector{?Int}

quinnj avatar May 17 '17 23:05 quinnj

As a prelude: Like all great works, I feel that reading John's julep was a process of gradually transcending many layers of deeper conceptual understanding :)

That said, after much reflection, I actually think that this application (nullables, nulls, etc) may have been the first to expose a bit of an inconsistency (*) in Julia's type system. For example, "covariant" objects like direct bindings follow the semantics of a dynamic language, while the elements of structs, refs, arrays, etc behave a lot like the types in a static type system. Unfortunately, once you reach any level of complexity, you end up taking covariant objects and throwing them into containers, passing that through a function, which maps these elementwise through another function, etc... and you end up with the problems discussed in this issue.

So, I think that none of

  • Union{T, Null}
  • Union{Some{T}, Null}
  • Union{T, Null{T}}
  • Union{Some{T}, Null{T}}
  • Nullable{T}

are perfectly suitable for all situations - more precisely, I think you can construct objectionable situations for any of the five types above. I think what happened in the Julep is that different people found different objectionable situations for several of the above.

Given that, my suggestion is to choose whatever is easiest for the typical user, and then do your best to patch up the resulting holes that appear - case in point being the OP. At this stage, I'd say that Union{T, Null} seems to take the prize for the "easiest to use" for simple applications. OTOH, I do hope that whatever happens can be made compatible with all the wonderful work done by @davidanthoff (and others), and if a consensus is reached I'm sure we'll all pitch in to help each other.

(*) - inconsistency is definitely too strong of a word (apologies to Jeff B, if you're reading this), but I couldn't think of another one. (Warning, going off-topic). I'm not saying that it's fundamentally broken or anything, but we definitely have to deal with covariant types and their invariant type parameters. In a static language, you would just use Nullable{T}, but hopefully with better syntax (like Swift). In most dynamic languages, you would probably choose use a Null object (like R). If Julia were going to take the dynamic type thing to it's limit, I'm guessing structs with covariant elements would be possible, and in the extreme case even refs and arrays could have semantically covariant eltypes with inference's job to allow them to be unboxed when provably safe... (the advantage of all that being that we wouldn't have to ever invoke inference to figure out the element types of a container (I think this was one of John's main goals with the Julep?), or play the fun game done by untyped comprehensions which I worry might propagate everywhere else in array-land, although that partly has to do with promote not just inference).

andyferris avatar May 18 '17 11:05 andyferris

Yes, declaring the return types in all these cases works. But I think that would prevent us from enabling a large class of very elegant data manipulation solutions. Here is one I just enabled on master in IterableTables.jl, ~~and then I'll try to explain the difficulty in solving this in the Query.jl context~~.

With the latest master version of IterableTables.jl, you can write simple queries that project and filter using standard julia generator syntax:

using DataFrames, IterableTables, NamedTuples

df = DataFrame(a=[1,2,3], b=[4,5,6])

df2 = DataFrame(@NT(c=i.a, b=i.b) for i in getiterator(df) if get(i.a)>2)

This will generate a new DataFrame with columns c and d and only the third row from the original (and of course you could also use any of the other table types supported in IterableTables.jl as the source and sink in this, and mix and match).

This is still somewhat verbose, but I think there is a very good chance that we could get this down to the following in maybe even the julia 1.0 time frame:

df2 = DataFrame({c=i.a, i.b} for i in df if i.a>2)

This would require a few things: First, a named tuple syntax in base. Doesn't have to be {} as I assumed here, but something like that. My sense is that this is likely for julia 1.0 and almost certain in the next update after julia 1.0. Second, the simpler filtering clause just requires that I'll switch IterableTables.jl over to DataValue, which I'm going to do in the next few days. The simpler filter would also work with Union{T,Null}. Third, getting rid of the getiterator call. This probably can easily be achieved for types like TypedTable or IndexedTable that have all their column name and type information encoded in their type, and less easy for something like DataFrame. I have a couple of other ideas for this, but that is probably the most difficult part. In any case, in the worst case, we would just still have the getiterator call there.

In my mind this is syntax that would potentially be used a lot by end users for very simple data manipulation of table data. Now, if we wanted this to work with the Union{T,Null} type, we would have to type annotate the fields of the named tuple that is generated, similar to what @quinnj recommended above, so the same query now would look like this:

df2 = DataFrame({c::?(Int)=i.a, b::?(Int)=i.b} for i in df if i.a>2)

I think that would make this whole idea probably moot, folks don't want to specify every column type in this kind of simple query.

Alright, running out of time, I'll write up why this is not trivially handled in something like Query.jl, or at least why I don't see a way to do this right now. Would obviously be great if one of you could think of a solution to that, once I've described the problem.

Oh, one last thing: I'm not surprised that this all works great in DataStreams.jl. I'm 99% sure I would not run into the slightest problem with Union{T,Null} if I was just looking at IterableTables.jl. But Query.jl is a whole other beast...

davidanthoff avatar May 20 '17 05:05 davidanthoff

Thanks for the details, that helps understanding the issue. So the concrete problem is very similar to this:

julia> using NamedTuples, Nulls
julia> itr = (@NT(c=i.a, b=i.b) for i in (@NT(a=1, b=2), @NT(a=2, b=null)));
julia> collect(itr)
2-element Array{NamedTuples._NT_c_b{Int64,T2} where T2,1}:
 (c = 1, b = 2)   
 (c = 2, b = null)

Or even this:

julia> itr2 = (((i[1], i[2]) for i in ((1, 2), (2, null))));
julia> collect(itr2)
2-element Array{Tuple{Int64,Any},1}:
 (1, 2)   
 (2, null)

The element type is chosen by collect via Base._default_eltype, which in turns relies on inference:

julia> Core.Inference.return_type(first, Tuple{typeof(itr)})
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}}

julia> Core.Inference.return_type(first, Tuple{typeof(itr2)})
Tuple{Int64,Any}

Fortunately, even though collect chooses to simplify the element type of the returned array to Array{NamedTuples._NT_c_b{Int64,T2} where T2,1} (via typeintersect), no information is lost about the element type of itr. By adjusting the definition of getiterator to:


function getiterator(source::Base.Generator)
    T = Core.Inference.return_type(first, Tuple{typeof(source)})
    S = typeof(source)
    return IterableTables.GeneratorIterator{T,S}(source)
end

I get this:

julia> eltype(getiterator(itr))
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}}

This means you should be able to perform whatever type computation you need inside _DataFrame. When a Union is encountered, you just need to iterate over its parts, extract the NamedTuple types inside, and Union the types corresponding to the same field in each NamedTuple. Not terribly pretty, but it should work.

BTW, I haven't been able to replicate the example from your last post on Julia 0.6, even with IterableTables from git master. The df2 = ... step gives ERROR: type UnionAll has no field types.

nalimilan avatar May 20 '17 12:05 nalimilan

Also, regarding the Pair examples given above, I've noticed this:

julia> [a=>b for (a, b) in zip([1, 2], [2, null])]
2-element Array{Pair{Int64,B} where B,1}:
 1=>2   
 2=>null

Notice that the return type is not Array{Pair{Int, Int}, Pair{Int, Null}}, so we're already applying some transformations to Union types. So I don't see why B where B above couldn't be kept as Union{Int, Null}. This would make a lot of sense if unions are made more efficient.

nalimilan avatar May 20 '17 12:05 nalimilan

I don't see why B where B above couldn't be kept as Union{Int, Null}. This would make a lot of sense if unions are made more efficient.

Perhaps, though it seems to me that the advantage of going as general as possible off the bat for the element type of a mutable object is that the element type doesn't have to be recomputed if a new element is added. I imagine that could be detrimental to performance. (Edit: This makes no sense, disregard)

ararslan avatar May 20 '17 18:05 ararslan

the element type doesn't have to be recomputed if a new element is added. I imagine that could be detrimental to performance.

What do you mean? The type of an array won't change when adding an element, so you can't even add an entry of a different type. This is indeed one of the drawbacks of precise typing, but we accept this for concrete types because it is essential for performance (and safety/predictability BTW).

nalimilan avatar May 20 '17 18:05 nalimilan

What I mean is that if you have Array{Pair{Int64, B} where B, 1}, B <: Any. If you constrain B to be e.g. Union{Int64, Null} then try to push!(x, 1=>1.2), you won't be able to. I (completely naively) assume that Julia defensively makes the non-constant type as wide as possible, since it doesn't know whether you want to add more randomly typed garbage in there. So constraining it to be a Union seems perhaps too narrow. I dunno though. (Also my performance argument above doesn't make any sense.)

ararslan avatar May 20 '17 18:05 ararslan

Yes, but how is that different from e.g. Array{Pair{Int, Int}}? We never complain when our arrays are concretely typed, so why would we care when they are "semi-concretely" (i.e. Union) typed?

nalimilan avatar May 20 '17 18:05 nalimilan

That's a good point. Don't mind me, then...

ararslan avatar May 20 '17 18:05 ararslan

@nalimilan There are really three questions, I believe: 1) can one recover enough type information to construct the right data structure to materialize a query, 2) will that data structure be fast and efficient and 3) can one write code that materializes things that is fast.

In your examples we end up with the following types for elements: a) Tuple{Int64,Any}, b) Union{_NT_c_b{Int64,Int64}, _NT_c_b{Int64,Nulls.Null}} and c) _NT_c_b{Int64,T2} where T2.

I believe b) is something we can only get if there are very few fields in the target struct that can take a missing value. Otherwise you end up with 2^n elements in the union. I think type inference will actually infer things as UnionAll types (as in c)) once the number of fields increases. So, for now I'll just limit the discussion to a) and c).

In the a) case we lost the required type information, i.e. we no longer can tell that the second tuple element can only be an Int or a missing value. So that looks like another example where things go wrong.

c) is interesting. In this case we have to deal with UnionAll types. In your example there is only one type variable, but if there were more fields that can take nulls, this might look like _NT_bla_bla{T1,T2,T3} where T1 where T2 where T3 etc.

Now, I believe you are right that with the type information in c) we could construct say a DataFrame that has the right column types etc, i.e. the type information we need is there and not lost. But if we think about materializing this say into an array, the question becomes whether the changes to julia base will also make arrays of UnionAll types as fast and efficient as arrays of small Union types. As far as I can tell that is not the plan, right? It also seems really almost impossible to me. An array of say type Array{_NT_a_b{T1,T2} where T1 where T2,1} would require a memory layout that can store e.g. _NT_a_b{Int, AnyType} as well as _NT_a_b{Int,Null}. I don't see how this could be achieved in any compact sense without resorting to pointers in the array that point towards the elements stored in the array, which would clearly destroy any performance.

Now, just for the case of materializing something into a DataFrame, the storage problem is not so bad, after all the named tuple would not be stored in any case, but disassembled again into a struct of array storage format. But that still leaves the question whether one can write fast code that materializes things. This is essentially the question what would happen to this loop (from _filldf)

for i in source
    # Store tuple elements into DataFrame columns
end

Type inference would infer the type of i as a UnionAll type. So if there is a plan to make UnionAll types fast in the same way that small Union types are to be made fast, things would work. But again, that is not the plan, right? And again, as long as these UnionAll types are inferred with a generic where T1 clause without any restrictions on T, I don't see how this could be done.

Now, if type inference was able to generally infer these UnionAll types as say _NT_{T1,T2} where T1<:Union{Int,Null} where T2<:Union{Int,Null} and then those kinds of UnionAll types were to get the same performance treatment that is planned for small Union types, things might work. But that would require two (major?) additional changes to julia base, namely for inference to be able to do this and then for code generation to make this case fast.

davidanthoff avatar May 21 '17 05:05 davidanthoff

Note that this is not an issue with the current approach taken in Query.jl where I use the DataValue type from the DataValues.jl package to represent missing values.

What does it do differently?

shashi avatar May 30 '17 21:05 shashi

f(a::Nullable{Int}) = Pair(a,a)

always returns a Pair{Nullable{Int},Nullable{Int}.

f(a::Union{Int,Null} = Pair(a,a)

will return either a Pair{Int,Int} or a Pair{Null,Null}, but not Pair{Union{Int,Null},Union{Int,Null}}.

DataValue behaves like Nullable in this example, and that kind of behavior is what I need in Query.jl.

davidanthoff avatar May 30 '17 22:05 davidanthoff

Unless you define f to be

f(a::Union{Int, Null}) = Pair{Union{Int, Null}, Union{Int, Null}}(a, a)

which has already been pointed out.

quinnj avatar May 30 '17 22:05 quinnj

As I wrote before, that doesn't work with the design for Query. I can't ask users to make these type annotations in every query, and the macro doesn't know anything about types, so it can't automatically rewrite this either. There is no other stage where one could rewrite this, so I don't see how this suggestion can be made to work.

davidanthoff avatar May 30 '17 22:05 davidanthoff

that kind of behavior is what I need in Query.jl.

Why is this? I have some guesses but I can't tell for sure.

shashi avatar Jun 01 '17 16:06 shashi

Will NamedTuple in Base (https://github.com/JuliaLang/julia/pull/22194) help with this issue?

nalimilan avatar Jun 04 '17 12:06 nalimilan

@shashi Query.jl is essentially a Julia implementation of LINQ, so for a general overview of the design any LINQ description should work.

Specifically, in many cases a query will often end up being translated into a chain of iterations, and a bunch of those chained iterators are map-like, i.e.they apply a user provided anonymous function to each element in the iteration (this is true for @select statements, but also for e.g. @group statements). When you collect a query, there is going to be a loop that iterates over these chained iterators. So it is really important that the calls to next of these chained iterators are fast. But if next returns a UnionAll type I don't see how the planned optimizations for small Unions would apply. Given that tables are represented as iterators of named tuples in Query, that would essentially be the most common situation.

Are you coming to juliacon this year? I'll dive into some detail of this in my talk.

davidanthoff avatar Jun 04 '17 17:06 davidanthoff

@nalimilan I don't think that will change anything. I'm really looking forward to NamedTuple in base, it will enable a whole bunch of other cool things in Query.

davidanthoff avatar Jun 04 '17 17:06 davidanthoff

But if next returns a UnionAll type I don't see how the planned optimizations for small Unions would apply.

Do you mean Union{@NT(x){T}, @NT(x){Null}) when you say UnionAll (I'm not sure if that Union is a UnionAll)?

maybe @vtjnash knows if that would work. I think this sort of thing should be taken care of the compiler either way...

I'll dive into some detail of this in my talk.

I'll make sure to attend!

shashi avatar Jun 04 '17 20:06 shashi

always returns a Pair{Nullable{Int},Nullable{Int}

The basic problem here is that this example seems to be too naive: it explicitly asserted that the inputs had to be Nullable{Int}, even though nothing in the program input actually implied (or could supply) that. Other valid expected inputs should include Nullable{Any} and Nullable{Union{}} (non-exhaustive).

Thus to be pedantic and accurate about what the language / compiler can do, the result of this function should really be represented as Union{Pair{Nullable{Int}, Nullable{Int}}, Pair{Nullable{Any}, Nullable{Any}}, Pair{Nullable{Union{}}, Nullable{Union{}}}, ...}. Which after simplifying is the abstract type: Pair{Nullable{T}, Nullable{T}} where T (aka, UnionAll(T, Nullable{T}), which is where the name comes from 🙂)

With the alternate design, the possible outputs are simply Union{Pair{Int, Int}, Pair{Null, Null}} and the set is bounded to these two options. That makes it much easier for the compiler to automatically optimize this code based upon it's actual usage, rather than the library authors type-assertions.

Aside: I'm not sure the latter is necessarily meaningful, so I would expect the intended result is instead Union{Pair{Int, Int}, Null} (e.g. isnull(x) ? x : f(x))

Making the Union representation faster and more ergonomic than the existing Nullable doesn't preclude one from using it in a Nullable-like fashion – this in fact is essentially what I would expect. The goal is just that implementation can become much simpler as the compiler can do the optimization work:

struct Maybe{T} # née Nullable{T}
  value::Union{T, Null}
end

vtjnash avatar Jun 05 '17 02:06 vtjnash

Do you mean Union{@NT(x){T}, @NT(x){Null}) when you say UnionAll (I'm not sure if that Union is a UnionAll)?

@shashi I was a bit sloppy there. If the function creates a named tuple with just one field things get inferred as Union{@NT(a::T),@NT(a::Null)}. With two fields as Union{@NT(a::T,b::T),@NT(a::T,b::Null),@NT(a::Null,b::T),@NT(a::Null,b::Null)}. So these unions grow exponentially in the number of fields. To still deal with this, type inference seems to switch over to presenting them as @NT(a::T1,b::T2,c::T3) where T1,T2,T3, i.e. a UnionAll.

The basic problem here is that this example seems to be too naive: it explicitly asserted that the inputs had to be Nullable{Int}, even though nothing in the program input actually implied (or could supply) that. Other valid expected inputs should include Nullable{Any} and Nullable{Union{}} (non-exhaustive).

@vtjnash Well, for Query.jl I can guarantee that the inputs are always properly typed as say Nullable{Int}. Here is a less contrived example. Say we have a source DataFrame with columns a, b and c. a is a DataVector{Int}, b is a DataVector{String} and c is a DataVector{Float64}. Then assume we have the following query:

@from i in df begin
@select {x=i.a, y=i.b, z=i.c}
@collect DataFrame
end

This will construct a chain of two iterators. The first iterator returns a named tuple for each row in the source DataFrame. The eltype of that iterator is @NT(a::Nullable{Int}, b::Nullable{String}, c::Nullable{Float64}). The second iterator is essentially a map like this map(i->@NT(x=i.a,y=i.b,z=i.c),first_iterator). After the @NT macro did its magic, this looks like this: map(i->_NT_x_y_z(i.a, i.b, i.c), first_iterator). _NT_x_y_z is a struct with three fields x, y and z.

The next function for this second iterator (the one created by map) is guaranteed to always return an element of type _NT_x_y_z{Nullable{Int}, Nullable{String}, Nullable{Float64}}, which is great because the loop that collects things in the end and iterates over these chained iterators is type stable because the call to next of the second iterator is type stable.

So how would this look with Union{T,Null}? The eltype of the first iterator could well be @NT(a::Union{Int,Null}, b::Union{String,Null}, c::Union{Float64,Null}). But the next function of the second iterator, i.e. the map, would be inferred as returning something like _NT_x_y_z{T1, T2, T3} where T1,T2,T3. A collect function that called this next function in a loop would not be fast because UnionAlls won't get the optimizations that you are planning for Union types, right?

With the alternate design, the possible outputs are simply Union{Pair{Int, Int}, Pair{Null, Null}} and the set is bounded to these two options. That makes it much easier for the compiler to automatically optimize this code based upon it's actual usage, rather than the library authors type-assertions.

Aside: I'm not sure the latter is necessarily meaningful, so I would expect the intended result is instead Union{Pair{Int, Int}, Null} (e.g. isnull(x) ? x : f(x))

I don't understand that, but maybe my exposition of the issue was just not very clear. Hopefully the explanation in this post is more comprehensive and closer to the problem I have in Query.jl with Union{T,Null}.

davidanthoff avatar Jun 05 '17 05:06 davidanthoff

I just reread @vtjnash's comment and somehow I had missed/not really noticed the last paragraph:

Making the Union representation faster and more ergonomic than the existing Nullable doesn't preclude one from using it in a Nullable-like fashion – this in fact is essentially what I would expect. The goal is just that implementation can become much simpler as the compiler can do the optimization work:

struct Maybe{T} # née Nullable{T}
    value::Union{T, Null}
end

That kind of design of course would solve the issue for Query.jl, essentially this amounts to sticking with a container based approach, but swapping out the internals of how that container is structured. If there are performance benefits of that over the old Nullable approach it would of course be excellent. From Query.jl's point of view this would probably be a very small change, i.e. I would just have to change the DataValue implementation over to this.

In general, what do folks think about the problem in this issue at this point? I've played with various other attempts to get around this in Query.jl over the last weeks, and I just don't see a solution that works. I did see that various packages are going full steam ahead in switching over to Nulls.jl (DataStreams, CategoricalArrays, DataTables?), but unless we find a solution to this I fear this will just lead to the next fragmentation of the data ecosystem, i.e. Query.jl and friends won't be able to follow suite and will use DataValues.jl, and then the other half of the package world will use Nulls.jl... Is that the plan right now?

davidanthoff avatar Jun 19 '17 05:06 davidanthoff

There was a fair bit of discussion of this topic at juliacon. I don't think everyone was present for all discussions (I'm pretty sure I missed some), but here is what I took away: some folks from the core team thought one could in principle solve this problem (@JeffBezanson), others were a lot more pessimistic that one could find a performant solution to this (@vtjnash). I didn't get a good sense whether folks thought this could all be finished for julia 1.0 (my vague recollection was that when I asked Jeff whether he would be sure that all of this could be done for 1.0, he hesitated, or something like that ;).

I think there was agreement that the following things would have to be done in base for a solution, in addition to the performance work that @quinnj is doing currently:

  • Make named tuples covariant.
  • Make structs that have one type parameter per field covariant.
  • There was some other thing about tuples where I had shown Jeff a problem, and he showed me a fix during some talk, but I don't recall the details. In any case, I believe Jeff had fixed that in about 10 minutes during the talk, so hopefully that code is either already merged or just waiting to be merged.

I don't think this list is necessarily sufficient, but not sure.

davidanthoff avatar Jul 06 '17 13:07 davidanthoff