DataFrames.jl
DataFrames.jl copied to clipboard
Discussion regarding custom row-like outputs in `combine` and `transform`
I'm currently working on a small implementation of a Tables.AbstractRow
. It's basically a wrapper around an OrderedDict
which is a Tables.AbstractRow
and defines all the convenience methods of a DataFrameRow
, like indexing with Not
.
The goal is to make it easier to work with complicated combine
calls where to iteratively build up your output. This is actually quite difficult and I think there are a few inconsistencies in the API
-
transform
works fine withByRow
.ByRow
returns a vector ofTables.AbstractRow
objects, which is itself a Table - With
combine
using thesrc => fun => AsTable
output, I can collapse to get single rows when I defineTables.columntable
for myDataRow
type. However this trick doesn't work when using an anonymous function. With the anonymous function, the output really has to be a matrix, named tuple, or data frame.
This might be an inconsistency in the API.
Regardless, I will move forward with the development of my package and call NamedTuple
at the very end of my combine
call to get the desired behavior.
Okay I've been digging into this more. It's possible to implement the behavior I want exactly with wrap(x::Tables.AbstractRow) = x
in the splitapplycombine
code.
Getting automatic "spreading" to work for this new type in transform
and select
requires expanding a few type signatures, but I can get it to work.
I think that the only type signatures I ever expanded were those that also allowed DataFrameRow
. So I think making DataFrameRow
a subtype of Tables.AbstractRow
and then changing type signatures in this context to allow for Tables.AbstractRow
would work.
The advantages are making it easier for user-defined types to work.
I guess this proposal doesn't really work though since Tables
is all about defining an interface and not working with abstract types. We could always add extra checks for if something implements the abstract row interface, but we already do Tables.columntable
for type Any
...
No clear solutions as of yet.
I am marking it both "breaking" and "non-braking" as I am not sure yet if what you propose is breaking or not π.
What we currently have is that we have a special treatment of with AsTable
as target column names:
-
AbstractDataFrame
,NamedTuple
,DataFrameRow
,AbstractMatrix
(this is legacy that was in DataFrames.jl before I started working with it) -
AbstratVector
of iterables (that gets expanded) - all else is passed to
Tables.columntable
(so if you ensure this function returns what you want then all should work)
What would be exactly your proposal? To add "Tables.AbstractRowto the list in the first bullet? (it was not included as
TablesAbstractRow` did not exist what that list was created π - that is why I say that I am cross if this would be breaking or not)
Indeed, I do not have a proposal as of yet. Maybe this is more of a gripe about how it's tougher than I thought to extend the interface and how I think the implementations between combine
with a Callable
and combine
with AsTable
are perhaps inconsistent.
Consider a type, DataRow <: Tables.AbstractRow
, that behavior essentially like a named tuple.
julia> transform(df, :a => ByRow(t -> DataRow(x = 4, y = 5)) => AsTable)
2Γ4 DataFrame
Row β a b x y
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββ
1 β 10 3 4 5
2 β 20 4 4 5
The above works only because Vector{<:AbstractRow}
is a Table
. So ByRow(fun)
. This hits the same path as ByRow(t -> (x = 4, y = 5))
.
If I don't implement Tables.column
for DataRow
, then I get
julia> transform(df, :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
ERROR: ArgumentError: 'DataRow' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
That is, spreading is not automatic and it's not clear how DataRow
can "opt-in" to spreading.
Moving on to combine
. Without Tables.columntable
for DataRow
I get
julia> combine(groupby(df, :a), :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
ERROR: ArgumentError: 'DataRow' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
When I define
function Tables.columntable(r::DataRow)
pnames = propertynames(r)
vals = values(r)
N = length(r)
nms = ntuple(i -> pnames[i], N)
vals = ntuple(i -> [vals[i]], N)
NamedTuple{nms}(vals)
end
It works, since there is a Tables.columntable
check in the post-processing with AsTable
julia> combine(groupby(df, :a), :a => (t -> DataRow(x = 4, y = 5)) => AsTable)
2Γ3 DataFrame
Row β a x y
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 10 4 5
2 β 20 4 5
same with ByRow
,
julia> combine(groupby(df, :a), :a => ByRow(t -> DataRow(x = 4, y = 5)) => AsTable)
2Γ3 DataFrame
Row β a x y
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 10 4 5
2 β 20 4 5
Finally, with a Callable
we get a Vector
of DataRow
s. The automatic promotion from Vector{<:Tables.AbstractRow}
to a table that we saw with transform
doesn't happen.
julia> combine(groupby(df, :a)) do _
DataRow(x = 1, y = 2)
end
2Γ2 DataFrame
Row β a x1
β Int64 DataRow
ββββββΌββββββββββββββββββββββββββββββββ
1 β 10 DataRow: (x = 1, y = 2)
2 β 20 DataRow: (x = 1, y = 2)
With the Tables.columntable
definition as above, we get the same:
julia> combine(groupby(df, :a)) do _
DataRow(x = 1, y = 2)
end
2Γ2 DataFrame
Row β a x1
β Int64 DataRow
ββββββΌββββββββββββββββββββββββββββββββ
1 β 10 DataRow: (x = 1, y = 2)
2 β 20 DataRow: (x = 1, y = 2)
This is inconsistent with the output from the AsTable
call above.
I guess the actionable item is to make the post-processing of the combine
with Callable
consistent with post-processing of combine
with AsTable
. We could do the following
- Add a
Tables.columntable
post-processing step withcombine
for bothAsTable
andCallable
. Then we special case outputs likeNamedTuple
,Vector
,DataFrame
etc.
That would break the second-to-last example above
- Formalize the notion of a
Row
. WithByRow
this means
-
NamedTuple
-
DataFrameRow
- Anything satisfying the
Tables.AbstractRow
interface. There is not yet aTables.isrow
function though - But this check would have to compete with
Tables.columntable
above.
- Don't do any special post-processing for
NamedTuple
orDataFrameRow
, but then check after a new vector has been created whether or not that vector is itself a table, and expand those columns as needed. This would be consistent with the behavior inByRow
withtransform
above. This would requireVector{DataFrameRow}
to be a table.
Thank you for your comments. Here are my answers:
the implementations between
combine
with aCallable
andcombine
withAsTable
are perhaps inconsistent.
Yes - they are inconsistent unfortunately for legacy reasons.
That is, spreading is not automatic and it's not clear how
DataRow
can "opt-in" to spreading.
It cannot - as it is not a table. It is a row of a table. You would have to write e.g.:
transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable)
It works, since there is a
Tables.columntable
Yes, because it makes DataRow
a table (not only a row of a table but also a table)
Finally, with a
Callable
we get
Yes Callable
is a different API (it is a legacy API that was designed and fixed a long time ago). This API is:
returning a data frame, a matrix, a NamedTuple, or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.
This is inconsistent with the output from the
AsTable
call above.
Yes, because this is a different API (unfortunately). But I have just double checked and it is documented like this.
I guess the actionable item is to make the post-processing of the combine with
Callable
consistent with post-processing of combine withAsTable
.
We will not do it because it would be massively breaking. E.g.:
julia> combine(DataFrame(a = 1), :a => (x -> [(1,2,3)]) => AsTable)
1Γ3 DataFrame
Row β x1 x2 x3
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 2 3
julia> combine(DataFrame(a = 1), :a => (x -> [(1,2,3)]))
1Γ1 DataFrame
Row β a_function
β Tupleβ¦
ββββββΌββββββββββββ
1 β (1, 2, 3)
julia> combine(DataFrame(a = 1), x -> [(1,2,3)])
1Γ1 DataFrame
Row β x1
β Tupleβ¦
ββββββΌβββββββββββ
1 β (1, 2, 3)
and the legacy behavior (the last one) follows the second one not the first one as you propose. As I have said - this has been in DataFrames.jl for many years and there is a risk that legacy code relies on this behavior and such changes are very tricky to find in large code bases.
Formalize the notion of a
Row
. WithByRow
this means
Notion of Row
is orthogonal to ByRow
.
Anything satisfying the
Tables.AbstractRow
interface.
This could be added - it would be breaking, but only mildly, so probably it is acceptable to treat any AbstractRow
as NamedTuple
. Currently the way to use it is to wrap it in a 1-element vector and use AsTable
as output.
Don't do any special post-processing for
NamedTuple
orDataFrameRow
Changing this would be very breaking.
Thanks for the detailed reply.
It is unfortunate that we can't break that inconsistency in combine
. The ad-hoc-ness of only allowing NamedTuple
, AbstractDataFrame
, DataFrameRow
and AbstractMatrix
definitely bugs me. It would be nice if there were a way to opt-in to this with something from Tables.
The other thing that bugs me is that the transform
with ByRow
only works because the output vector happens to be a Table
. This is not something that the Tables.AbstractRow
interface requires (and it does not want people to subtype).
Can you confirm this is intended? And how you conceptualized this post-processing promotion when you wrote the code? I read through it but it wasn't obvious.
Also note that
julia> transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable)
ERROR: ArgumentError: New columns must have the same length as old columns
fails, neither does Ref
. So I dont think there is a great way to opt-in to spreading beyond converting to NamedTuple
.
transform(df, :a => (t -> [DataRow(x = 4, y = 5)]) => AsTable)
fails
This is a good point - it is not easy to spread unless you convert to NamedTuple
or DataFrameRow
.
The ad-hoc-ness of only allowing
NamedTuple
,AbstractDataFrame
,DataFrameRow
andAbstractMatrix
definitely bugs me.
This is the legacy of the "early days" of DataFrames.jl unfortunately.
This is not something that the
Tables.AbstractRow
interface requires
could you please explain what you mean here? as I am not fully clear here.
And how you conceptualized this post-processing promotion when you wrote the code?
Also here I am not fully clear what you are asking about. In general ByRow
code has nothing to do with post-processing (ByRow
is just a way to write a broadcasted operation without using a .
which is not supported).
As I have said I think we can make AbstractRow
behave the same way as NamedTuple
as it is only mildly breaking and it would resolve your issue with spreading.
As I have said I think we can make
AbstractRow
behave the same way asNamedTuple
as it is only mildly breaking and it would resolve your issue with spreading.
That would be a natural appraoch, however something can follow the interface of Tables.AbstractRow
without actually subtying. According to the docs, when working with Tables, package writers are discouraged from relying on subtyping.
This is not something that the
Tables.AbstractRow
interface requirescould you please explain what you mean here? as I am not fully clear here.
That's what I mean in this comment. Currently, the fact that Vector{<:AbstractRow}
is a table is just nice to have, but people who write a row-like object wont necessarily have defined Vector{RowType}
to be a table. Perhaps this is something that can be added to the interface cc @quinnj .
So there is a tension between, if a type is not NamedTuple
, AbstractDataFrame
etc. whether we should use Tables.columns
or some yet-to-be-defined equivelent method for rows.
package writers are discouraged from relying on subtyping.
Then we would need isrow
trait for type of element returned to be added to Tables.jl as you proposed earlier. I agree that given Julia does not support multiple inheritance now it is a better solution.
Would adding such isrow
trait in Tables.jl and then using this trait as a check in DataFrames.jl solve your issues (I know this is not all you wanted but this would be OK to add as isrow
trait does not exist now so we would not break anything :smile: ).
It would have to be isroworcolumntable
because both transform
and combine
might get both as a result.
So maybe there should be a better way to navigate exactly how an object fits in the Tables.jl interface
So let us keep this issue open for now, and could you open a discussion in Tables.jl to have @quinnj comment on how it would fit there? Thank you!
@pdeffebach , back in your post here, could you expound on how exactly DataRow
is defined/implemented? I just want to note that just doing struct DataRow <: Tables.AbstractRow
isn't enough to satisfy the required Tables.jl interface, and may be leading to some of the issues you see. So to clarify, even if you subtype Tables.AbstractRow
, you still need to define Tables.getcolumn(dr::DataRow, i::Int)
, Tables.getcolumn(dr::DataRow, nm::Symbol)
, and Tables.columnnames(dr::DataRow)
. You may well be doing that, and if so, you can mostly ignore this comment, but I just wanted to make sure since someone else was recently confused by this exact issue (they thought just subtyping was "enough").
Thanks. Indeed I implemented all the methods. I even re-implemented the ones that are default by Tables.AbstractRow
. getproperty
, getindex
, iterate
, haskey
, includeing Tables.getcolumn
etc. along with coversions to NamedTuple
Ok cool; I don't really understand the nuances of the combine
/transform
/ByRow
/AsTable
implementations to really understand what the issue is here, but if there's something Tables.jl-specific, happy to chat more: probably just need things boiled down a little more so I can understand what's not working as expected or whatever.
In short:
- DataFrames.jl currently recognizes
DataFrameRow
andNamedTuple
of scalars as a "row" type (and in consequence expands them to multiple columns in transformations). - These are the only two types that are of "row kind" that are currently recognized by DataFrames.jl.
- @pdeffebach would like to have a custom type that DataFrames.jl would treat in the same way as
DataFrameRow
. - For this we need either to start recognizing subtypes of
AbstractRow
as being of "row kind" or we need to add some trait e.g.istablerow
that DataFrames.jl could use to decide that some value should be treated in the same way in transformations asDataFrameRow
.
Note that we do not want all types to be treated this way, as most often when doing transformations the user wants to keep the result in a single column (so e.g. vector of structs is not expanded to multiple columns in transformations although it is Tables.jl table). We want types to explicitly opt-in to be treated as rows of a table and then get special treatment.
On top of that, there are times when we are given an object and we need to decide if it is a "row" or it is a "table". So it would be nice to have a way of handling those distinctions via Tables
Ok, that makes more sense. Yeah, there aren't really great ways of defining istable
or isrow
very accurately due to Julia's flexibility. One idea to consider is just adding Tables.Row
to the "blessed" list of row types. It's a wrapper type that wraps any "row"-interface compatible object and gives that row object all the Tables.AbstractRow
behavior, but would be easy to dispatch on.
I wrote a package that solves this problem with meta-programming. It's called AddToField.
using DataFrames, PrettyTables, Chain
julia> df = DataFrame(
group = repeat(1:2, 50),
income_reported = rand(100),
income_imputed = rand(100));
julia> @chain df begin
groupby(:group)
combine(_; keepkeys = false) do d
@addnt begin
@add "Group" first(d.group)
@add "Mean reported income" m_reported = mean(d.income_reported)
@add "Mean imputed income" m_imputed = mean(d.income_imputed)
@add "Difference" m_reported - m_imputed
end
end
pretty_table(;nosubheader = true)
end
βββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββ
β Group β Mean reported income β Mean imputed income β Difference β
βββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββ€
β 1 β 0.523069 β 0.53696 β -0.0138915 β
β 2 β 0.473178 β 0.41845 β 0.0547277 β
βββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββ
Nice. Why do you want it separate from DataFramesMeta.jl?
It's a general tool that has no external dependencies apart from MacroTools. It makes it really easy to create NamedTuples in general so making it in DataFramesMeta would be too restrictive.
Does this discussion have anything to do with why the code written by @bkamins here: https://bkamins.github.io/julialang/2020/12/24/minilanguage.html
df = DataFrames.DataFrame(id = 1:6,
name = ["Aaron Aardvark", "Belen Barboza",
"ζ₯ ι", "ΠΠ°Π½ΠΈΠΈΠ» ΠΡΠ±ΠΎΠ²",
"ElΕΌbieta ElblΔ
g", "Felipe Fittipaldi"],
age = [50, 45, 40, 35, 30, 25],
eye = ["blue", "brown", "hazel", "blue", "green", "brown"],
grade_1 = [95, 90, 85, 90, 95, 90],
grade_2 = [75, 80, 65, 90, 75, 95],
grade_3 = [85, 85, 90, 85, 80, 85])
DataFrames.combine(df,
:name => DataFrames.ByRow(x -> (; ([:firsname, :lastname] .=> split(x))...)))
Doesn't work? Error: ArgumentError: return value of type NamedTuple{(:firsname, :lastname),Tuple{SubString{String},SubString{String}}} is currently not allowed with ByRow.
Edit: It seems that I needed to update my version of Julia, my apologies!
@Ethycs:
- The code you have pasted is not what I have posted in the blog. It is missing
=> AsTable
(which expands the column into multiple columns) - The code you have pasted works - it just does not expand columns.
- You must be using a very old version of DataFrames.jl, as the change that removed this error was merged ~1 year ago in https://github.com/JuliaData/DataFrames.jl/pull/2461
Sorry to close and re-open. I was going to make a comment but then decided not to. Accidentally closed it after.
However making ByRow
outputs require conforming to the row interface for Tables.jl instead of keys
still seems like a valid (breaking) change.
The question is how pressing this is in practice. While I agree it would improve consistency but the problem is that keys
is supported by many standard collections, while Tables.AbstractRow
interface is not.
Not pressing. Unless we see reports of users making hundreds or thousands of new columns from a ByRow(f) => AsTable
operation.