Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

Tables that are tuples

Open ablaom opened this issue 2 years ago • 10 comments

I have encountered two thorny issues which have arisen because Tables.jl natively implements its interface for certain kinds of tuples. The conflicts arising are not quite the same, which suggests that even if workarounds are found, the underlying problem may reappear elsewhere, so I thought it worth inviting comment here. One issue is in a package I maintain but the other is not. The issues are here and here.

In Tables.jl the following is both a table and a tuple of tables:

X = ((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

This leads to ambiguity in the issues mentioned.

@darsnack @OkonSamuel

ablaom avatar Mar 03 '22 23:03 ablaom

How do you propose to resolve it? In general, under current rules both are valid. Some data structure, conceptually, can be both a table and a collection of tables. What I mean is that you can construct an object that declares itself to be a table, but at the same time it can store tables inside - there is no contradiction.

In general e.g. a vector of NamedTuple is a table, not only a Tuple of them:

julia> X = [(a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7])]
2-element Vector{NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}:
 (a = [1, 3], b = [2, 3])
 (a = [2, 5], b = [4, 7])

julia> Tables.istable(X)
true

In this particular case the root cause of the problem is this definition I think:

function isiterabletable(x::T) where {T}
    isiterable(x) || return false

    if Base.IteratorEltype(x)==Base.HasEltype()
        et = Base.eltype(x)
        if et === Union{}
            return false
        elseif et <: NamedTuple
            return true
        elseif et===Any
            return missing
        else
            return false
        end
    else
        return missing
    end
end

which special cases NamedTuple eltype.

So, essentially - what you propose is to make Tables.istable return false if it is a vector/tuple of NamedTuple?

One more comment about Tables.istable:

Note that not all valid tables will return true, since it's possible to satisfy the Tables.jl interface at "run-time", e.g. a Generator of NamedTuples iterates NamedTuples, which satisfies the AbstractRow interface, but there's no static way of knowing that the generator is a table.

That is. Even if Tables.istable returns false it does not say that the object is not a table. The contract is one-way only. If true returned then it is a table; if false is returned then it does not tell you anything.

bkamins avatar Mar 04 '22 08:03 bkamins

@bkamins Thanks indeed for your thoughts.

I didn't quite understand this comment:

In general e.g. a vector of NamedTuple is a table, not only a Tuple of them:

The object you mention is not a tuple. And this example does not cause problems in the cases I mentioned.

I don't have a concrete proposal, unless declaring all unnamed tuples to be non-tables is realistic option.

In the first issue, the case is made that tuples in Julia have a very special meaning, as they are the grouping type for the return value of "multi-valued" functions. There was a feeling that implementing Tables for Base tuple types is therefore a kind of piracy, I guess. They are chaining multiple transformers and want the option of processing multiple data containers at once. Here "multiple" is detected using tuple. An alternative is the use of varargs, but this breaks the nice (and natural) pipeline syntax they are using.

It would be nice to know if there are (unnamed) tuples, that are also tables, in common use.

ablaom avatar Mar 06 '22 22:03 ablaom

The object you mention is not a tuple.

That is a point of my comment. Tuple is not treated in any special way by Tables.jl. It is a collection, just like a vector is a collection. What Tables.jl does is checking eltype of that collection. If eltype is a NamedTuple then the collection is a table.

Examples:

> t1 = ((a=1,b=2),)
((a = 1, b = 2),)

julia> eltype(t1)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}

julia> Tables.istable(t1)
true

julia> t2 = [(a=1,b=2)]
1-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)

julia> eltype(t2)
NamedTuple{(:a, :b), Tuple{Int64, Int64}}

julia> Tables.istable(t2)
true

In summary - type of the container currently does not matter for Tables.jl. Only the type of elements stored in it matter.


Given your comments maybe you want to propose another rule:

Never treat Tuple to be a table and make Tables.istable return false for Tuple no matter what they store.

This could be acceptable, as it is not very likely that someone stores tables in tuples. It would be mildly breaking, but maybe @quinnj could accept it.

Is introduction of such a rule something you would propose then?

bkamins avatar Mar 07 '22 10:03 bkamins

@bkamins Thanks for clarifying the comment!

Given your comments maybe you want to propose another rule:

Never treat Tuple to be a table and make Tables.istable return false for Tuple no matter what they store.

I believe this would indeed resolve both the issues mentioned. Thanks for considering it!

@darsnack Do you agree this is a worthwhile proposal from the point-of-view of https://github.com/JuliaML/MLUtils.jl/issues/61 ?

ablaom avatar Mar 07 '22 20:03 ablaom

It seems problematic to decide that tuples should be special-cased and never be considered as tables. What would be the rationale for this in Tables.jl? AFAICT this is a problem at https://github.com/JuliaAI/ScientificTypes.jl/issues/182 because ScientificTypes treats tuples in a special way, but another package could have a similar request about vectors (and indeed it's not uncommon to have ambiguities between tables and other types such as vectors). Why not just add that restriction in ScientificTypes only?

nalimilan avatar Mar 12 '22 22:03 nalimilan

Why not just add that restriction in ScientificTypes only?

@nalimilan Thanks for looking into this.

What would be the rationale for this in Tables.jl?

The rational is I don't see how either issue can be resolved outside of a Tables.jl restriction, as unnatural as that might seem from the point-of-view of the package considered in isolation.

How do you mean "add restriction" in ScientificTypes, exactly? We can add something to documentation, but any user trying to to get ScientficTypes to play with tuple-tables will get unexpected behaviour. In ScientificTypes a basic and very natural requirement is that the scitype of a tuple is the Tuple of the scitypes. I don't see how we can also require scitype(X) = Table{..} when isstable(X)==true if X might also be a tuple without problems.

What would be your proposal to resolve the issue in MLUtils.jl?

ablaom avatar Mar 13 '22 20:03 ablaom

I just mean that MLUtils could disallow using tuples as tables, i.e. throwing an error when getting one or treating them as non-table objects. This is basically the same thing as you propose doing in Tables.jl, but rather than disallowing this everywhere, only do so in MLUtils (where they are problematic).

nalimilan avatar Mar 14 '22 07:03 nalimilan

@nalimilan Thanks. I will suggest, then, that those two packages treat all tuples as non-tabular objects, and document the fact.

ablaom avatar Mar 15 '22 21:03 ablaom

@ablaom - I think (I am not sure - just guessing) that what @nalimilan wants to say is that in some contexts Tuples as tables can make a lot of sense. For example currently the following tuple is a table:

julia> ((a=1,b=2), (a=3, b=4))
((a = 1, b = 2), (a = 3, b = 4))

Then if you have such a table and want to perform some manipulations on it the output is also Tuple of NamedTuple. The point is that such operations are non-allocating. Which means that if user has millions of small tables that need to be processed this is the best way such operation can be expressed as:

  1. we avoid malloc
  2. we avoid GC (which in such contexts often kills performance)

Of course this is probably not a realistic scenario for ML applications but it is plausible for some transactional system that needs to response to queries in near real time (and wants to e.g. avoid GC hiccups).

Having said that, maybe there is a work-around for this use case that does not require Tuples to get the same effect.

bkamins avatar Mar 15 '22 22:03 bkamins

@quinnj - how do you see this issue?

bkamins avatar Mar 15 '22 22:03 bkamins