Tables.jl
Tables.jl copied to clipboard
Tables that are tuples
I have encountered two thorny issues which have arisen because Tables.jl natively implements its interface for certain kinds of tuples. The conflicts arising are not quite the same, which suggests that even if workarounds are found, the underlying problem may reappear elsewhere, so I thought it worth inviting comment here. One issue is in a package I maintain but the other is not. The issues are here and here.
In Tables.jl the following is both a table and a tuple of tables:
X = ((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))
This leads to ambiguity in the issues mentioned.
@darsnack @OkonSamuel
How do you propose to resolve it? In general, under current rules both are valid. Some data structure, conceptually, can be both a table and a collection of tables. What I mean is that you can construct an object that declares itself to be a table, but at the same time it can store tables inside - there is no contradiction.
In general e.g. a vector of NamedTuple
is a table, not only a Tuple
of them:
julia> X = [(a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7])]
2-element Vector{NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}:
(a = [1, 3], b = [2, 3])
(a = [2, 5], b = [4, 7])
julia> Tables.istable(X)
true
In this particular case the root cause of the problem is this definition I think:
function isiterabletable(x::T) where {T}
isiterable(x) || return false
if Base.IteratorEltype(x)==Base.HasEltype()
et = Base.eltype(x)
if et === Union{}
return false
elseif et <: NamedTuple
return true
elseif et===Any
return missing
else
return false
end
else
return missing
end
end
which special cases NamedTuple
eltype.
So, essentially - what you propose is to make Tables.istable
return false
if it is a vector/tuple of NamedTuple
?
One more comment about Tables.istable
:
Note that not all valid tables will return true, since it's possible to satisfy the Tables.jl interface at "run-time", e.g. a Generator of NamedTuples iterates NamedTuples, which satisfies the AbstractRow interface, but there's no static way of knowing that the generator is a table.
That is. Even if Tables.istable
returns false
it does not say that the object is not a table. The contract is one-way only. If true
returned then it is a table; if false
is returned then it does not tell you anything.
@bkamins Thanks indeed for your thoughts.
I didn't quite understand this comment:
In general e.g. a vector of NamedTuple is a table, not only a Tuple of them:
The object you mention is not a tuple. And this example does not cause problems in the cases I mentioned.
I don't have a concrete proposal, unless declaring all unnamed tuples to be non-tables is realistic option.
In the first issue, the case is made that tuples in Julia have a very special meaning, as they are the grouping type for the return value of "multi-valued" functions. There was a feeling that implementing Tables for Base tuple types is therefore a kind of piracy, I guess. They are chaining multiple transformers and want the option of processing multiple data containers at once. Here "multiple" is detected using tuple. An alternative is the use of varargs, but this breaks the nice (and natural) pipeline syntax they are using.
It would be nice to know if there are (unnamed) tuples, that are also tables, in common use.
The object you mention is not a tuple.
That is a point of my comment. Tuple
is not treated in any special way by Tables.jl. It is a collection, just like a vector is a collection. What Tables.jl does is checking eltype
of that collection. If eltype
is a NamedTuple
then the collection is a table.
Examples:
> t1 = ((a=1,b=2),)
((a = 1, b = 2),)
julia> eltype(t1)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}
julia> Tables.istable(t1)
true
julia> t2 = [(a=1,b=2)]
1-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 1, b = 2)
julia> eltype(t2)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}
julia> Tables.istable(t2)
true
In summary - type of the container currently does not matter for Tables.jl. Only the type of elements stored in it matter.
Given your comments maybe you want to propose another rule:
Never treat
Tuple
to be a table and makeTables.istable
returnfalse
forTuple
no matter what they store.
This could be acceptable, as it is not very likely that someone stores tables in tuples. It would be mildly breaking, but maybe @quinnj could accept it.
Is introduction of such a rule something you would propose then?
@bkamins Thanks for clarifying the comment!
Given your comments maybe you want to propose another rule:
Never treat Tuple to be a table and make Tables.istable return false for Tuple no matter what they store.
I believe this would indeed resolve both the issues mentioned. Thanks for considering it!
@darsnack Do you agree this is a worthwhile proposal from the point-of-view of https://github.com/JuliaML/MLUtils.jl/issues/61 ?
It seems problematic to decide that tuples should be special-cased and never be considered as tables. What would be the rationale for this in Tables.jl? AFAICT this is a problem at https://github.com/JuliaAI/ScientificTypes.jl/issues/182 because ScientificTypes treats tuples in a special way, but another package could have a similar request about vectors (and indeed it's not uncommon to have ambiguities between tables and other types such as vectors). Why not just add that restriction in ScientificTypes only?
Why not just add that restriction in ScientificTypes only?
@nalimilan Thanks for looking into this.
What would be the rationale for this in Tables.jl?
The rational is I don't see how either issue can be resolved outside of a Tables.jl restriction, as unnatural as that might seem from the point-of-view of the package considered in isolation.
How do you mean "add restriction" in ScientificTypes, exactly? We can add something to documentation, but any user trying to to get ScientficTypes to play with tuple-tables will get unexpected behaviour. In ScientificTypes a basic and very natural requirement is that the scitype of a tuple is the Tuple
of the scitypes. I don't see how we can also require scitype(X) = Table{..}
when isstable(X)==true
if X
might also be a tuple without problems.
What would be your proposal to resolve the issue in MLUtils.jl?
I just mean that MLUtils could disallow using tuples as tables, i.e. throwing an error when getting one or treating them as non-table objects. This is basically the same thing as you propose doing in Tables.jl, but rather than disallowing this everywhere, only do so in MLUtils (where they are problematic).
@nalimilan Thanks. I will suggest, then, that those two packages treat all tuples as non-tabular objects, and document the fact.
@ablaom - I think (I am not sure - just guessing) that what @nalimilan wants to say is that in some contexts Tuple
s as tables can make a lot of sense. For example currently the following tuple is a table:
julia> ((a=1,b=2), (a=3, b=4))
((a = 1, b = 2), (a = 3, b = 4))
Then if you have such a table and want to perform some manipulations on it the output is also Tuple
of NamedTuple
. The point is that such operations are non-allocating. Which means that if user has millions of small tables that need to be processed this is the best way such operation can be expressed as:
- we avoid malloc
- we avoid GC (which in such contexts often kills performance)
Of course this is probably not a realistic scenario for ML applications but it is plausible for some transactional system that needs to response to queries in near real time (and wants to e.g. avoid GC hiccups).
Having said that, maybe there is a work-around for this use case that does not require Tuple
s to get the same effect.
@quinnj - how do you see this issue?