Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

Might Tables.jl be a new Home for `StatsPlots.@df` ?

Open BeastyBlacksmith opened this issue 3 years ago • 32 comments

The @df macro allows writing column names in function calls as in

using StatsPlots
import RDatasets
iris = RDatasets.dataset("datasets", "iris")
@df iris scatter(:SepalLength, :SepalWidth, group = :Species)
@df iris sum(:SepalLength)  # not only useful for plotting

It will be removed from StatsPlots.jl probably in 1.0 and if we don't find a better home, it will got to Plots.jl (https://github.com/JuliaPlots/Plots.jl/pull/3351), but actually I think it should be somewhere more general purpose.

If it's going here it should be probably be renamed to something like @table.

BeastyBlacksmith avatar Aug 23 '22 09:08 BeastyBlacksmith

I don't think Tables.jl is the right place for it. Maybe we should just have a TablePlots.jl package similar to how we have TableOperations.jl?

rofinn avatar Aug 23 '22 15:08 rofinn

Maybe we should just have a TablePlots.jl package

That doesn't make sense, since this macro is not tied to plotting at all. It works for any function call/ block of function calls. It's only historic accident that it ended up in StatsPlots.jl, since there was no Table interface back then.

The only thing that is kind of plotting related is the automatic passing of a label keyword if, the function to be called supports this, but that part could also be stripped and put in a Plots-specific extension of this macro.

BeastyBlacksmith avatar Aug 23 '22 16:08 BeastyBlacksmith

This sounds like something which could live in DataFramesMeta (with a different name). Cc: @pdeffebach @bkamins

nalimilan avatar Aug 25 '22 20:08 nalimilan

Can someone try and see how much of the @df macro functionality can be replicated by just using @with from DataFramesMeta.jl?

pdeffebach avatar Aug 25 '22 21:08 pdeffebach

I already checked it. @with is better. The only problem is that @with works with data frames, while @df works with any tables.

bkamins avatar Aug 25 '22 21:08 bkamins

Ah.

Current implementation of @with does not use any DataFrames.jl-specific things actually, so I could make it broader.

pdeffebach avatar Aug 25 '22 21:08 pdeffebach

So if we document that @with contract is Tables.jl (which I think should be easy) then @df can be made an alias for @with if someone wants it.

Other than that @with has all functionality that @df has but it has some more functionality for better handling special cases (like $)

bkamins avatar Aug 25 '22 21:08 bkamins

Thats a good idea. Could we then move @with to Tables.jl?

BeastyBlacksmith avatar Aug 26 '22 07:08 BeastyBlacksmith

Tables.jl is supposed to be lightweight interface package that other packages depend on, not something where convenience functions intended for end users should live. I don't think any package exists currently for that.

nalimilan avatar Aug 26 '22 07:08 nalimilan

Perhaps TableOperations.jl?

DilumAluthge avatar Aug 26 '22 07:08 DilumAluthge

Perhaps TableOperations.jl?

Sounds good to me

BeastyBlacksmith avatar Sep 06 '22 08:09 BeastyBlacksmith

One problem is the current implementation relies on ByRow, which is defined in DataFrames.jl. I wonder if ByRow can go somewhere else too?

pdeffebach avatar Sep 06 '22 12:09 pdeffebach

ByRow is not depending on DataFrames.jl, so:

struct ByRow{T} <: Function
    fun::T
end

# invoke the generic AbstractVector function to ensure function is called
# exactly once for each element
(f::ByRow)(cols::AbstractVector...) =
    invoke(map,
           Tuple{typeof(f.fun), ntuple(i -> AbstractVector, length(cols))...},
           f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

could be extracted to Tables.jl if all agree it is useful to have it there (we probably would need to restrict table::NamedTuple in the second signature to ensure we are working with Tables.columntable).

bkamins avatar Sep 06 '22 12:09 bkamins

@quinnj What do you think?

nalimilan avatar Sep 06 '22 13:09 nalimilan

Yeah, I'd be down for that. If someone can make a PR, we're getting close to doing a new minor release here soon once I get the getrows stuff sorted out.

quinnj avatar Sep 06 '22 13:09 quinnj

I will make a PR since we need to sync it with DataFrames.jl 1.4 release.

bkamins avatar Sep 06 '22 13:09 bkamins

@quinnj - why does ColumnTable allow other containers than vectors?

const ColumnTable = NamedTuple{names, T} where {names, T <: NTuple{N, AbstractArray{S, D} where S}} where {N, D}

bkamins avatar Sep 06 '22 15:09 bkamins

Also let us confirm that the definition of ByRow for Tables.jl would be:

struct ByRow{T} <: Function
    fun::T
end

# invoke the generic AbstractVector function to ensure function is called
# exactly once for each element
(f::ByRow)(cols::AbstractVector...) =
    invoke(map,
           Tuple{typeof(f.fun), ntuple(i -> AbstractVector, length(cols))...},
           f.fun, cols...)
(f::ByRow)(col::AbstractVector) =
    invoke(map,
           Tuple{typeof(f.fun), (AbstractVector,)},
           f.fun, col)
(f::ByRow)(table) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

The idea is:

  • if we pass vector or vectors (in the case of a single vector we treat it as a vector even if it is a table) - we just apply map to the passed data (making sure to always process each element)
  • if we pass a table (other than a vector) - then we pass named tuples to the function

What is an important corner case is when a table is a vector. Then ByRow will ignore the fact that such vector is a table (for Tables.RowTable this is fortunately the same behavior, but in general e.g. vector of structs, e.g. [1=>2, 3=>4] will be treated as a vector and not as a table).

This is what we need in DataFrames.jl, but I want to confirm that the same expectation is OK for Tables.jl.

bkamins avatar Sep 06 '22 15:09 bkamins

Better always treat the argument as a table, otherwise the behavior is inconsistent. Do we really need the (f::ByRow)(cols::AbstractVector...) method in DataFrames or can it be considered as an implementation detail that could be done using a separate function?

nalimilan avatar Sep 06 '22 15:09 nalimilan

Better always treat the argument as a table, otherwise the behavior is inconsistent.

I know and that is why I am asking. The problem is that whatever you do if you write in DataFrames.jl e.g.:

julia> using DataFrames

julia> df = DataFrame(col=[1=>2, 3=>4])
2×1 DataFrame
 Row │ col
     │ Pair…
─────┼───────
   1 │ 1=>2
   2 │ 3=>4

julia> select(df, :col => ByRow(identity))
2×1 DataFrame
 Row │ col_identity
     │ Pair…
─────┼──────────────
   1 │ 1=>2
   2 │ 3=>4

you get a no-op, while if you made ByRow always treat a table as a table you would get instead:

julia> select(df, :col => x -> collect(Tables.namedtupleiterator(x)))
2×1 DataFrame
 Row │ col_function
     │ NamedTuple…
─────┼─────────────────────────
   1 │ (first = 1, second = 2)
   2 │ (first = 3, second = 4)

so it would be a change in DataFrames.jl (moreover - a change that we do not want).

bkamins avatar Sep 06 '22 16:09 bkamins

Which means that maybe we need to restrict ByRow to just NamedTuple of vectors (which brings me back to the earlier question - why ColumnTable allows arrays of any dimensionality - this is OK, but I would prefer to have a clear situation)

bkamins avatar Sep 06 '22 16:09 bkamins

I have a diff locally that changes it to:

const ColumnTable = NamedTuple{names, T} where {names, T <: NTuple{N, AbstractVector{S} where S}} where {N}

as I also noticed this definition being too general.

quinnj avatar Sep 06 '22 16:09 quinnj

Then this is problematic. Vectors of named tuples are generally considered as completely valid tables, so it would be weird for them not to work with ByRow if it was moved to Tables.jl. I don't know what to do...

nalimilan avatar Sep 06 '22 16:09 nalimilan

If we want to move @with and let it use @byrow we need the behavior from DataFrames.jl, the reason is that e.g.:

julia> using DataFramesMeta

julia> df = DataFrame(col=[1=>2, 3=>4])
2×1 DataFrame
 Row │ col
     │ Pair…
─────┼───────
   1 │ 1=>2
   2 │ 3=>4

julia> @with df @byrow begin
           :col
       end
2-element Vector{Pair{Int64, Int64}}:
 1 => 2
 3 => 4

is a desired behavior (and not transforming it to named tuples as I have shown above)

I don't know what to do...

As I have said - the only thing that we can do is restrict ByRow to its current behavior in DataFrames.jl (i.e. working on vectors as vectors or Tables.ColumnTable as a special case). Then users - if they have other tables - would have to convert them to Tables.ColumnTable - it is a bit problematic but not end of the world.

Note that it is only a problem if ByRow is used outside of a table, which is not its typical use-case. Normally ByRow is designed to work on columns of a table and then there is no problem.

bkamins avatar Sep 06 '22 17:09 bkamins

I have opened https://github.com/JuliaData/Tables.jl/pull/293 as a follow up to ByRow discussion.

bkamins avatar Sep 07 '22 10:09 bkamins

Then users - if they have other tables - would have to convert them to Tables.ColumnTable - it is a bit problematic but not end of the world.

My fear is rather that custom table types will override ByRow to work on them, so that in the end it supports most table types, except AbstractVectors (including Vector{<:NamedTuple}.

Note that it is only a problem if ByRow is used outside of a table, which is not its typical use-case. Normally ByRow is designed to work on columns of a table and then there is no problem.

I think the difficulty is that we designed ByRow to work in DataFrames, where AsTable => ByRow(f) passes a NamedTuple to ByRow so there's no need to support other kinds of tables. But in TableOperations for example it could make sense for AsTable to support other types. Not sure. Let's hope that ColumnTable is the right choice for most table types, including row-oriented ones.

nalimilan avatar Sep 07 '22 12:09 nalimilan

Let's hope that ColumnTable is the right choice for most table types, including row-oriented ones.

RowTable is also supported now without a problem if a source is row-oriented.

bkamins avatar Sep 07 '22 18:09 bkamins

Right. The point is that as soon as people start using ByRow with more than just ColumnTable, it's likely that somebody will try it with Vector{<:NamedTuple}. And that could happen indirectly, due to a user passing such an object to a package which assumed any table would work.

nalimilan avatar Sep 07 '22 20:09 nalimilan

I agree. That is why I tried to highlight in the documentation that the default API is vectors, and ColumnTable is a special case exception.

bkamins avatar Sep 08 '22 06:09 bkamins