DataFrames.jl Discussion regarding custom row-like outputs in `combine` and `transform`

I'm currently working on a small implementation of a Tables.AbstractRow. It's basically a wrapper around an OrderedDict which is a Tables.AbstractRow and defines all the convenience methods of a DataFrameRow, like indexing with Not.

The goal is to make it easier to work with complicated combine calls where to iteratively build up your output. This is actually quite difficult and I think there are a few inconsistencies in the API

transform works fine with ByRow. ByRow returns a vector of Tables.AbstractRow objects, which is itself a Table
With combine using the src => fun => AsTable output, I can collapse to get single rows when I define Tables.columntable for my DataRow type. However this trick doesn't work when using an anonymous function. With the anonymous function, the output really has to be a matrix, named tuple, or data frame.

This might be an inconsistency in the API.

Regardless, I will move forward with the development of my package and call NamedTuple at the very end of my combine call to get the desired behavior.

Dec 09 '20 23:12 pdeffebach

Okay I've been digging into this more. It's possible to implement the behavior I want exactly with wrap(x::Tables.AbstractRow) = x in the splitapplycombine code.

Getting automatic "spreading" to work for this new type in transform and select requires expanding a few type signatures, but I can get it to work.

I think that the only type signatures I ever expanded were those that also allowed DataFrameRow. So I think making DataFrameRow a subtype of Tables.AbstractRow and then changing type signatures in this context to allow for Tables.AbstractRow would work.

The advantages are making it easier for user-defined types to work.

I guess this proposal doesn't really work though since Tables is all about defining an interface and not working with abstract types. We could always add extra checks for if something implements the abstract row interface, but we already do Tables.columntable for type Any...

No clear solutions as of yet.

Dec 10 '20 02:12 pdeffebach

I am marking it both "breaking" and "non-braking" as I am not sure yet if what you propose is breaking or not 😄.

What we currently have is that we have a special treatment of with AsTable as target column names:

AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix (this is legacy that was in DataFrames.jl before I started working with it)
AbstratVector of iterables (that gets expanded)
all else is passed to Tables.columntable (so if you ensure this function returns what you want then all should work)

What would be exactly your proposal? To add "Tables.AbstractRowto the list in the first bullet? (it was not included asTablesAbstractRow` did not exist what that list was created 😄 - that is why I say that I am cross if this would be breaking or not)

Dec 10 '20 07:12 bkamins

Indeed, I do not have a proposal as of yet. Maybe this is more of a gripe about how it's tougher than I thought to extend the interface and how I think the implementations between combine with a Callable and combine with AsTable are perhaps inconsistent.

Consider a type, DataRow <: Tables.AbstractRow, that behavior essentially like a named tuple.

julia> transform(df, :a => ByRow(t -> DataRow(x = 4, y = 5)) => AsTable)
2×4 DataFrame
 Row │ a      b      x      y     
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │    10      3      4      5
   2 │    20      4      4      5

The above works only because Vector{<:AbstractRow} is a Table. So ByRow(fun). This hits the same path as ByRow(t -> (x = 4, y = 5)).

If I don't implement Tables.column for DataRow, then I get

julia> transform(df, :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
ERROR: ArgumentError: 'DataRow' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface

That is, spreading is not automatic and it's not clear how DataRow can "opt-in" to spreading.

Moving on to combine. Without Tables.columntable for DataRow I get

julia> combine(groupby(df, :a), :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
ERROR: ArgumentError: 'DataRow' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface

When I define


function Tables.columntable(r::DataRow)
	pnames = propertynames(r)
	vals = values(r)
	N = length(r)
	nms = ntuple(i -> pnames[i], N)
	vals = ntuple(i -> [vals[i]], N)
	NamedTuple{nms}(vals)
end

It works, since there is a Tables.columntable check in the post-processing with AsTable

julia> combine(groupby(df, :a), :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
2×3 DataFrame
 Row │ a      x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │    10      4      5
   2 │    20      4      5

same with ByRow,

julia> combine(groupby(df, :a), :a => ByRow(t -> DataRow(x = 4, y = 5)) => AsTable)
2×3 DataFrame
 Row │ a      x      y     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │    10      4      5
   2 │    20      4      5

Finally, with a Callable we get a Vector of DataRows. The automatic promotion from Vector{<:Tables.AbstractRow} to a table that we saw with transform doesn't happen.

julia> combine(groupby(df, :a)) do _           
            DataRow(x = 1, y = 2)
       end
2×2 DataFrame
 Row │ a      x1                      
     │ Int64  DataRow                 
─────┼────────────────────────────────
   1 │    10  DataRow: (x = 1, y = 2)
   2 │    20  DataRow: (x = 1, y = 2)

With the Tables.columntable definition as above, we get the same:

julia> combine(groupby(df, :a)) do _
           DataRow(x = 1, y = 2)
       end
2×2 DataFrame
 Row │ a      x1                      
     │ Int64  DataRow                 
─────┼────────────────────────────────
   1 │    10  DataRow: (x = 1, y = 2)
   2 │    20  DataRow: (x = 1, y = 2)

This is inconsistent with the output from the AsTable call above.

I guess the actionable item is to make the post-processing of the combine with Callable consistent with post-processing of combine with AsTable. We could do the following

Add a Tables.columntable post-processing step with combine for both AsTable and Callable. Then we special case outputs like NamedTuple, Vector, DataFrame etc.

That would break the second-to-last example above

Formalize the notion of a Row. With ByRow this means

NamedTuple
DataFrameRow
Anything satisfying the Tables.AbstractRow interface. There is not yet a Tables.isrow function though
But this check would have to compete with Tables.columntable above.

Don't do any special post-processing for NamedTuple or DataFrameRow, but then check after a new vector has been created whether or not that vector is itself a table, and expand those columns as needed. This would be consistent with the behavior in ByRow with transform above. This would require Vector{DataFrameRow} to be a table.

Dec 10 '20 18:12 pdeffebach

Thank you for your comments. Here are my answers:

the implementations between combine with a Callable and combine with AsTable are perhaps inconsistent.

Yes - they are inconsistent unfortunately for legacy reasons.

That is, spreading is not automatic and it's not clear how DataRow can "opt-in" to spreading.

It cannot - as it is not a table. It is a row of a table. You would have to write e.g.:

transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable)

It works, since there is a Tables.columntable

Yes, because it makes DataRow a table (not only a row of a table but also a table)

Finally, with a Callable we get

Yes Callable is a different API (it is a legacy API that was designed and fixed a long time ago). This API is:

returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.

This is inconsistent with the output from the AsTable call above.

Yes, because this is a different API (unfortunately). But I have just double checked and it is documented like this.

I guess the actionable item is to make the post-processing of the combine with Callable consistent with post-processing of combine with AsTable.

We will not do it because it would be massively breaking. E.g.:

julia> combine(DataFrame(a = 1), :a => (x -> [(1,2,3)]) => AsTable)
1×3 DataFrame
 Row │ x1     x2     x3
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> combine(DataFrame(a = 1), :a => (x -> [(1,2,3)]))
1×1 DataFrame
 Row │ a_function
     │ Tuple…
─────┼────────────
   1 │ (1, 2, 3)

julia> combine(DataFrame(a = 1), x -> [(1,2,3)])
1×1 DataFrame
 Row │ x1
     │ Tuple…
─────┼───────────
   1 │ (1, 2, 3)

and the legacy behavior (the last one) follows the second one not the first one as you propose. As I have said - this has been in DataFrames.jl for many years and there is a risk that legacy code relies on this behavior and such changes are very tricky to find in large code bases.

Formalize the notion of a Row. With ByRow this means

Notion of Row is orthogonal to ByRow.

Anything satisfying the Tables.AbstractRow interface.

This could be added - it would be breaking, but only mildly, so probably it is acceptable to treat any AbstractRow as NamedTuple. Currently the way to use it is to wrap it in a 1-element vector and use AsTable as output.

Don't do any special post-processing for NamedTuple or DataFrameRow

Changing this would be very breaking.

Dec 10 '20 18:12 bkamins

Thanks for the detailed reply.

It is unfortunate that we can't break that inconsistency in combine. The ad-hoc-ness of only allowing NamedTuple, AbstractDataFrame, DataFrameRow and AbstractMatrix definitely bugs me. It would be nice if there were a way to opt-in to this with something from Tables.

The other thing that bugs me is that the transform with ByRow only works because the output vector happens to be a Table. This is not something that the Tables.AbstractRow interface requires (and it does not want people to subtype).

Can you confirm this is intended? And how you conceptualized this post-processing promotion when you wrote the code? I read through it but it wasn't obvious.

Also note that

julia> transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable)
ERROR: ArgumentError: New columns must have the same length as old columns

fails, neither does Ref. So I dont think there is a great way to opt-in to spreading beyond converting to NamedTuple.

Dec 10 '20 18:12 pdeffebach

transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable) fails

This is a good point - it is not easy to spread unless you convert to NamedTuple or DataFrameRow.

The ad-hoc-ness of only allowing NamedTuple, AbstractDataFrame, DataFrameRow and AbstractMatrix definitely bugs me.

This is the legacy of the "early days" of DataFrames.jl unfortunately.

This is not something that the Tables.AbstractRow interface requires

could you please explain what you mean here? as I am not fully clear here.

And how you conceptualized this post-processing promotion when you wrote the code?

Also here I am not fully clear what you are asking about. In general ByRow code has nothing to do with post-processing (ByRow is just a way to write a broadcasted operation without using a . which is not supported).

As I have said I think we can make AbstractRow behave the same way as NamedTuple as it is only mildly breaking and it would resolve your issue with spreading.

Dec 10 '20 19:12 bkamins

As I have said I think we can make AbstractRow behave the same way as NamedTuple as it is only mildly breaking and it would resolve your issue with spreading.

That would be a natural appraoch, however something can follow the interface of Tables.AbstractRow without actually subtying. According to the docs, when working with Tables, package writers are discouraged from relying on subtyping.

This is not something that the Tables.AbstractRow interface requires

could you please explain what you mean here? as I am not fully clear here.

That's what I mean in this comment. Currently, the fact that Vector{<:AbstractRow} is a table is just nice to have, but people who write a row-like object wont necessarily have defined Vector{RowType} to be a table. Perhaps this is something that can be added to the interface cc @quinnj .

So there is a tension between, if a type is not NamedTuple, AbstractDataFrame etc. whether we should use Tables.columns or some yet-to-be-defined equivelent method for rows.

Dec 10 '20 19:12 pdeffebach

package writers are discouraged from relying on subtyping.

Then we would need isrow trait for type of element returned to be added to Tables.jl as you proposed earlier. I agree that given Julia does not support multiple inheritance now it is a better solution.

Would adding such isrow trait in Tables.jl and then using this trait as a check in DataFrames.jl solve your issues (I know this is not all you wanted but this would be OK to add as isrow trait does not exist now so we would not break anything :smile: ).

Dec 10 '20 19:12 bkamins

It would have to be isroworcolumntable because both transform and combine might get both as a result.

So maybe there should be a better way to navigate exactly how an object fits in the Tables.jl interface

Dec 10 '20 19:12 pdeffebach

So let us keep this issue open for now, and could you open a discussion in Tables.jl to have @quinnj comment on how it would fit there? Thank you!

Dec 10 '20 19:12 bkamins

@pdeffebach , back in your post here, could you expound on how exactly DataRow is defined/implemented? I just want to note that just doing struct DataRow <: Tables.AbstractRow isn't enough to satisfy the required Tables.jl interface, and may be leading to some of the issues you see. So to clarify, even if you subtype Tables.AbstractRow, you still need to define Tables.getcolumn(dr::DataRow, i::Int), Tables.getcolumn(dr::DataRow, nm::Symbol), and Tables.columnnames(dr::DataRow). You may well be doing that, and if so, you can mostly ignore this comment, but I just wanted to make sure since someone else was recently confused by this exact issue (they thought just subtyping was "enough").

Dec 10 '20 20:12 quinnj

Thanks. Indeed I implemented all the methods. I even re-implemented the ones that are default by Tables.AbstractRow. getproperty, getindex, iterate, haskey, includeing Tables.getcolumn etc. along with coversions to NamedTuple

Dec 10 '20 21:12 pdeffebach

Ok cool; I don't really understand the nuances of the combine/transform/ByRow/AsTable implementations to really understand what the issue is here, but if there's something Tables.jl-specific, happy to chat more: probably just need things boiled down a little more so I can understand what's not working as expected or whatever.

Dec 10 '20 21:12 quinnj

In short:

DataFrames.jl currently recognizes DataFrameRow and NamedTuple of scalars as a "row" type (and in consequence expands them to multiple columns in transformations).
These are the only two types that are of "row kind" that are currently recognized by DataFrames.jl.
@pdeffebach would like to have a custom type that DataFrames.jl would treat in the same way as DataFrameRow.
For this we need either to start recognizing subtypes of AbstractRow as being of "row kind" or we need to add some trait e.g. istablerow that DataFrames.jl could use to decide that some value should be treated in the same way in transformations as DataFrameRow.

Note that we do not want all types to be treated this way, as most often when doing transformations the user wants to keep the result in a single column (so e.g. vector of structs is not expanded to multiple columns in transformations although it is Tables.jl table). We want types to explicitly opt-in to be treated as rows of a table and then get special treatment.

Dec 10 '20 21:12 bkamins

On top of that, there are times when we are given an object and we need to decide if it is a "row" or it is a "table". So it would be nice to have a way of handling those distinctions via Tables

Dec 10 '20 21:12 pdeffebach

Ok, that makes more sense. Yeah, there aren't really great ways of defining istable or isrow very accurately due to Julia's flexibility. One idea to consider is just adding Tables.Row to the "blessed" list of row types. It's a wrapper type that wraps any "row"-interface compatible object and gives that row object all the Tables.AbstractRow behavior, but would be easy to dispatch on.

Dec 10 '20 22:12 quinnj

I wrote a package that solves this problem with meta-programming. It's called AddToField.

using DataFrames, PrettyTables, Chain
julia> df = DataFrame(
           group = repeat(1:2, 50),
           income_reported = rand(100),
           income_imputed = rand(100));

julia> @chain df begin 
           groupby(:group)
           combine(_; keepkeys = false) do d
               @addnt begin 
                   @add "Group" first(d.group)
                   @add "Mean reported income" m_reported = mean(d.income_reported)
                   @add "Mean imputed income" m_imputed = mean(d.income_imputed)
                   @add "Difference" m_reported - m_imputed
               end
           end
           pretty_table(;nosubheader = true)
       end
┌───────┬──────────────────────┬─────────────────────┬────────────┐
│ Group │ Mean reported income │ Mean imputed income │ Difference │
├───────┼──────────────────────┼─────────────────────┼────────────┤
│     1 │             0.523069 │             0.53696 │ -0.0138915 │
│     2 │             0.473178 │             0.41845 │  0.0547277 │
└───────┴──────────────────────┴─────────────────────┴────────────┘

Mar 30 '21 22:03 pdeffebach

Nice. Why do you want it separate from DataFramesMeta.jl?

Mar 30 '21 22:03 bkamins

It's a general tool that has no external dependencies apart from MacroTools. It makes it really easy to create NamedTuples in general so making it in DataFramesMeta would be too restrictive.

Mar 30 '21 22:03 pdeffebach

Does this discussion have anything to do with why the code written by @bkamins here: https://bkamins.github.io/julialang/2020/12/24/minilanguage.html

df = DataFrames.DataFrame(id = 1:6,
                      name = ["Aaron Aardvark", "Belen Barboza",
                              "春 陈", "Даниил Дубов",
                              "Elżbieta Elbląg", "Felipe Fittipaldi"],
                      age = [50, 45, 40, 35, 30, 25],
                      eye = ["blue", "brown", "hazel", "blue", "green", "brown"],
                      grade_1 = [95, 90, 85, 90, 95, 90],
                      grade_2 = [75, 80, 65, 90, 75, 95],
                      grade_3 = [85, 85, 90, 85, 80, 85])

DataFrames.combine(df,
:name => DataFrames.ByRow(x -> (; ([:firsname, :lastname] .=> split(x))...)))

Doesn't work? Error: ArgumentError: return value of type NamedTuple{(:firsname, :lastname),Tuple{SubString{String},SubString{String}}} is currently not allowed with ByRow.

Edit: It seems that I needed to update my version of Julia, my apologies!

Aug 24 '21 06:08 Ethycs

@Ethycs:

The code you have pasted is not what I have posted in the blog. It is missing => AsTable (which expands the column into multiple columns)
The code you have pasted works - it just does not expand columns.
You must be using a very old version of DataFrames.jl, as the change that removed this error was merged ~1 year ago in https://github.com/JuliaData/DataFrames.jl/pull/2461

Aug 24 '21 06:08 bkamins

Sorry to close and re-open. I was going to make a comment but then decided not to. Accidentally closed it after.

However making ByRow outputs require conforming to the row interface for Tables.jl instead of keys still seems like a valid (breaking) change.

Sep 24 '21 13:09 pdeffebach

The question is how pressing this is in practice. While I agree it would improve consistency but the problem is that keys is supported by many standard collections, while Tables.AbstractRow interface is not.

Sep 24 '21 14:09 bkamins

Not pressing. Unless we see reports of users making hundreds or thousands of new columns from a ByRow(f) => AsTable operation.

Sep 24 '21 14:09 pdeffebach

DataFrames.jl DataFrames.jl copied to clipboard

Discussion regarding custom row-like outputs in `combine` and `transform`

DataFrames.jl
DataFrames.jl copied to clipboard