DataFrames.jl
DataFrames.jl copied to clipboard
Automatic re-sizing with ByRow in combine
perhaps for combine
and ByRow
we should re-size tables
julia> df = DataFrame(a = [1, 2]);
julia> combine(groupby(df, :a), :a => (t -> [100, 200]) => :b)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 100
2 │ 1 200
3 │ 2 100
4 │ 2 200
julia> combine(groupby(df, :a), :a => ByRow(t -> [100, 200]) => :b)
2×2 DataFrame
Row │ a b
│ Int64 Array…
─────┼───────────────────
1 │ 1 [100, 200]
2 │ 2 [100, 200]
No big deal if not. I actually think this change might not be that hard.
This would be a breaking change so it's fine to wait until much further down the line.
:a => ByRow(t -> [100, 200])
is an equivalent of (t -> [100, 200]).(df.a)
and what you report is exactly expected. Otherwise there would be no way to obtain the result you present in your question if it were asked for.
Actually the hard case is when you have groups having more than one row in them, and then you need something like (I am writing it verbosely for clarity):
julia> df = DataFrame(a = [1, 1]);
julia> combine(groupby(df, :a), :a => (x -> reduce(vcat, (t -> [100, 200]).(x))) => :b)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 100
2 │ 1 200
3 │ 1 100
4 │ 1 200
or use flatten
for post processing
But as I wrote on Slack, could an alternative be passing something akin to => AsTable
in this case, which would do the flattening automatically? Flattening is not hard per se, but it does require to spell out all the new column names again, which could get tedious quickly.
Maybe => Flattened
or something like that?
@jkrumbiegel - we have more space here to show the examples. Can you please give the exact use case you have? I am hesitant to add Flattened
option as I feel it would not be needed frequently. But maybe I am wrong - so it would be best to work on a concrete example to work out the best approach.
Sure, it is exactly the situation that one has a function returning multiple rows, which is applied ByRow
. I think this is quite common, at least I encounter it frequently when cleaning or preparing dataframes for analysis.
You just need a dataframe in which a column contains some kind of complex object, which actually encodes data that belongs to separate rows.
Let me just repeat the example from Slack here so it doesn't get lost:
df = DataFrame(:participant => [1, 2], :blockdescription => ["a,false,b,true", "a,true,b,true"])
function extract_trials(blockdescription)
temp = map(Iterators.partition(split(blockdescription, ","), 2)) do (condition, success)
condition, parse(Bool, success)
end
(condition = first.(temp), success = last.(temp))
end
@pipe df |>
groupby(_, :participant) |>
combine(_, :blockdescription => ByRow(extract_trials) => AsTable)
This gives
2×3 DataFrame
Row │ participant condition success
│ Int64 Array… BitVector
─────┼──────────────────────────────────────────────────────
1 │ 1 SubString{String}["a", "b"] Bool[0, 1]
2 │ 2 SubString{String}["a", "b"] Bool[1, 1]
Where it is an extra step of work to repeat the column names and flatten the dataframe via flatten(df, [:condition, :success])
to:
4×3 DataFrame
Row │ participant condition success
│ Int64 String Bool
─────┼────────────────────────────────────
1 │ 1 a false
2 │ 1 b true
3 │ 2 a true
4 │ 2 b true
So I think it's not uncommon to have a row-wise computation resulting in several output rows, and there should be a standard idiom for that, instead of having to go in manually and repeating the newly created column names in a flatten
command.
A sink descriptor that says "the output of this call consists of multiple rows each time, and all new columns should be flattened".
Just to clarify - I have a problem with your example that each group contains only one row. Is this on purpose or not? Do you want in general to allow for multiple rows per group?
The question is that ByRow
is designed to process things rowwise, so in your case a more natural thing to write for me would be:
julia> df = DataFrame(:participant => [1, 2], :blockdescription => ["a,false,b,true", "a,true,b,true"])
2×2 DataFrame
Row │ participant blockdescription
│ Int64 String
─────┼───────────────────────────────
1 │ 1 a,false,b,true
2 │ 2 a,true,b,true
julia> df.id = axes(df, 1)
Base.OneTo(2)
julia> function extract_trials(blockdescription)
temp = map(Iterators.partition(split(blockdescription, ","), 2)) do (condition, success)
condition, parse(Bool, success)
end
(condition = first.(temp), success = last.(temp))
end
extract_trials (generic function with 1 method)
julia> @pipe df |>
groupby(_, :id) |>
combine(_, :blockdescription => (x -> extract_trials(only(x))) => AsTable)
4×3 DataFrame
Row │ id condition success
│ Int64 SubStrin… Bool
─────┼───────────────────────────
1 │ 1 a false
2 │ 1 b true
3 │ 2 a true
4 │ 2 b true
the point is that in your pipeline you do not take into account the grouping variable in any way (there is no logical link between rows in a single group), so the question is why is it (i.e. the group-by step) present there at all?
It should also work for multiple rows, in this example there is just one. You are right that the example is not very good in that way, that the grouping is equivalent to doing a row-wise transformation. I have to think about it more, whether my problem stems from a missing feature in DataFrames, or my own misconception of the problem here.
So far, I still believe that it's useful to have ByRow return multiple rows, no matter if that's happening with or without groupby. Of course one can always write the concatenation logic and pass that function without ByRow, but it saves a lot of work not having to write the vector form, and not having to do the concatenation.
I have to think about it more, whether my problem stems from a missing feature in DataFrames, or my own misconception of the problem here.
To be clear - if we decide this is something that is needed I am OK to add it, just it cannot be added to ByRow
, but it must be a separate mechanism (and then the question is if it is common enough to justify its existence).
and not having to do the concatenation.
Indeed this is what combine
does by default on GroupedDataFrame
by design if it gets a multiple-row output (and I take advantage of this fact in my solution).
However, the question is:
- do we need a special syntax in DataFrames.jl mini-language (this should not be done lightly)
- or we can have a lightweight function that does the concatenation and then you can just compose this in your function call (this is what I would prefer)
So to be clear instead of
:col => ByRow(some_fun) => Flatten(:outcol_name)
I would prefer to have
:col => (x -> generic_flatten_function(some_fun.(x))) => :outcol_name
as this is more flexible. The reason is that Flatten()
behavior has to be hardcoded in DataFrames.jl, while generic_flatten_fucntion
can live on its own independent on DataFrames.jl. This is much better as DataFrames.jl is a core package and generic_flatten_fucntion
is a function that different users might want to work differently, so it is better not to hard-code it into the package. The rules in https://dataframes.juliadata.org/latest/man/split_apply_combine/#The-Split-Apply-Combine-Strategy are alredy very complex.
For vectors generic_flatten_function
is just reduce(vcat, ...)
. For NamedTuple
s there is no such function AFAICT but it can be easily added.
@quinnj - actually I think TableOperations.joinpartitions
could be used for this. Could you please comment what would be the preferred way to use it to convert:
[((a=[1], b=[2]),(a=[3], b=[4])]
into a table with columns :a
and :b
containing [1,2]
and [3,4]
respectively?
Thank you!
I still think this is a good idea.
-
a => ByRow(t -> [100, 200])
is an equivalent of(t -> [100, 200]).(df.a)
and what you report is exactly expected. Otherwise there would be no way to obtain the result you present in your question if it were asked for.
Without ByRow
, you are required to use combine(df, :a => (t -> Ref([100, 200]))
to get a vector of vectors. We can make the same requirement with ByRow
, making things more consistent.
-
I don't think adding
Flatten
is a good idea. It's too niche a use to have a special syntax for. However I think flattening the column is the most consistent default -
The key here is not that each group has one row. The key is that
combine
re-sizes the data frame while other functions do not. You could havecombine(df, :a => ByRow(t -> [100, 200]))
and it would produce a data frame with 2x the number of rows as the original. -
To be clear - if we decide this is something that is needed I am OK to add it, just it cannot be added to
ByRow
, but it must be a separate mechanism (and then the question is if it is common enough to justify its existence).
I don't fully understand this logic. In my mind, ByRow
only guarantees the input is a NamedTuple
, but doesn't have to enforce anything about the output.
Nonetheless, I am fine with the current behavior until after 1.0.
Just to be clear:
combine(_, :blockdescription => ByRow(extract_trials) => AsTable)
is exactly the same as
combine(_, :blockdescription => (x -> extract_trials.(x)) => AsTable)
and this will not change. (DataFrames.jl is not at the stage of design any more 😄)
So the discussion is how should:
combine(_, :blockdescription => (x -> extract_trials.(x)) => AsTable)
be treated. And again - this is already settled - we have a set of rules how "a vector of something" is processed and this will not change.
What we could do is either:
- add new "verbs" or "nouns" in our mini-language to change how "a vector of something" is processed
- add utility functions that process "a vector of something" into "a vector of something else" (without affecting the mini-language)
And I am just saying that I prefer option 2. over option 1., as the mini-language should be as minimal as possible (there are already requests to add AsVector
, RowNumber
, proprow
). The difference between these three things and the thing that is requested here is that "flattening" can be done without using the mini-language, whereas AsVector
, RowNumber
, proprow
have to be built-in, as otherwise the things that they are intended to produce are very difficult to do without them.
The key is that
combine
re-sizes the data frame while other functions do not.
This is an orthogonal issue. The thing that is different is that select
and transform
are target shape aware (and enforce this shape) while combine
accepts any output shape.
In particular this means that ByRow
must work the same way both for combine
and select
because ByRow
definition is independent from which function calls it. ByRow
can be even used without select
or combine
present at all, e.g.:
julia> ByRow(sin)(1:10)
10-element Array{Float64,1}:
0.8414709848078965
0.9092974268256817
0.1411200080598672
-0.7568024953079282
-0.9589242746631385
-0.27941549819892586
0.6569865987187891
0.9893582466233818
0.4121184852417566
-0.5440211108893698
and it is essentially a broadcasting operation:
julia> @code_warntype ByRow(sin)(1:10)
Variables
f::Core.Compiler.Const(ByRow{typeof(sin)}(sin), false)
cols::Tuple{UnitRange{Int64}}
Body::Array{Float64,1}
1 ─ %1 = Base.getproperty(f, :fun)::Core.Compiler.Const(sin, false)
│ %2 = Core.tuple(%1)::Core.Compiler.Const((sin,), false)
│ %3 = Core._apply_iterate(Base.iterate, Base.broadcasted, %2, cols)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(sin),Tuple{UnitRange{Int64}}}
│ %4 = Base.materialize(%3)::Array{Float64,1}
└── return %4
julia> @code_warntype sin.(1:10)
Variables
#self#::Core.Compiler.Const(var"##dotfunction#253#1"(), false)
x1::UnitRange{Int64}
Body::Array{Float64,1}
1 ─ %1 = Base.broadcasted(Main.sin, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(sin),Tuple{UnitRange{Int64}}}
│ %2 = Base.materialize(%1)::Array{Float64,1}
└── return %2
Is this now clearer what I want to say?
Just to be clearer what I want to say is that one can write:
julia> df = DataFrame(a = [1, 2]);
julia> flattenme(x) = reduce(vcat, x)
flattenme (generic function with 1 method)
julia> combine(groupby(df, :a), :a => flattenme∘ByRow(t -> [100, 200]) => :b)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 100
2 │ 1 200
3 │ 2 100
4 │ 2 200
and flattenme
is not a part of the mini language, but gets the job done. Exactly the same way how skipmissing
works for skipping missing values. We do not have to have skipmissing
as a part of the mini-language to use it.
and this will not change. (DataFrames.jl is not at the stage of design any more )
I think this is the most important part. If I had noticed this a few months ago, i would have pushed harder for this change, as I don't think the "consistent with broadcast" reasoning is more important than the consistency with non-ByRow
. But this is a fair rule and not too mentally complicated.
I agree that flattenme
would be good to have. I will think about where it should live.
then flattenme
should not live in DataFrames.jl as it is a general function for processing tables (that is why I asked @quinnj a qauestion if we already do not have it in TableOperations.jl).
"consistent with broadcast" reasoning is more important than the consistency with non-
ByRow
Yes - a few months ago we could have changed the definition of ByRow
. But currently the definition of ByRow
is that it is a shorthand for broadcast and it will stay.
This came up on discourse today, with the complication that the ByRow function returns several columns:
julia> df = DataFrame(a=[1,2])
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
julia> f(x) = [[10x,11x], [100x,111x]];
julia> combine(groupby(df, :a), :a => ByRow(f) => [:b, :c])
2×3 DataFrame
Row │ a b c
│ Int64 Array… Array…
─────┼─────────────────────────────
1 │ 1 [10, 11] [100, 111]
2 │ 2 [20, 22] [200, 222]
This requires a flatten
post-processing. It would be nice if flattenme
could also support this case, but the proposed implementation gives a "wrong" result:
julia> flattenme(x) = reduce(vcat, x);
julia> combine(groupby(df, :a), :a => flattenme∘ByRow(f) => [:b, :c])
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 10 11
2 │ 1 100 111
3 │ 2 20 22
4 │ 2 200 222
A possibility is to transform the function's output into a matrix before further processing:
julia> combine(groupby(df, :a), :a => flattenme∘ByRow(Base.splat(hcat)∘f) => [:b, :c])
4×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 10 100
2 │ 1 11 111
3 │ 2 20 200
4 │ 2 22 222
But the whole thing is rather involved... It would be nice if there was a more intuitive/explicit fix for combine(groupby(df, :a), :a => ByRow(f) => [:b, :c])
to produce flattened output?
It would be nice if there was a more intuitive/explicit fix for
combine(groupby(df, :a), :a => ByRow(f) => [:b, :c])
to produce flattened output?
My feeling is that using flatten
for post processing is most readable.
If one wants this in a transformation I still feel it is better to define a special function that does the flattening rather than extending the mini-language (which already is quite complex).
In general such a function seems natural to be added to SplitApplyCombine.jl. Maybe @andyferris would agree to add dims
argument to SplitApplyCombine.flatten
so that would work the way it is asked for?
Sorry - I'm a bit behind and am still trying to understand this. I'm happy to try and help.
However, I'm a little lost what is being asked for?
If you want to flatten multi-dimensional arrays, there is combinedims
which we could extend with a dims
argument to give you more control over which dimensions of the inner arrays are brought out, the order of the dimensions, etc. That's not precisely what you want here, but you are modelling a grouped dataframe as a bit like a 1D collection of 2D collections, right?
Is it more of a multiple nesting thing where you want to convert collection[a][b][c]
to be new_collection[a,c][b]
? (Perhaps not a,c
exactly but at least that kind of ordering). Or you want to control the order better to correct the fault above so it's like collection[a][b,c]
-> new_collection[a,c,b]
kind of ordering?
(Sorry if that notation is super confusing)
The users ask to transform:
[[[1,2], [3,4]],
[[5,6], [7,8]]]
into
[[1,2,5,6], [3,4,7,8]]
In Julia Base the transformation to do it is:
julia> x = [[[1,2], [3,4]],
[[5,6], [7,8]]]
2-element Vector{Vector{Vector{Int64}}}:
[[1, 2], [3, 4]]
[[5, 6], [7, 8]]
julia> [reduce(vcat, getindex.(x, i)) for i in 1:length(x[1])]
2-element Vector{Vector{Int64}}:
[1, 2, 5, 6]
[3, 4, 7, 8]
but it is a bit awkward.
Generally the idea is:
- one gets a collection of collections of collections
a
- each element
a[i]
represents a row, and has the same length - each element of
a[i][j]
is a row entry in columnj
and is a collection - the result should be a collection of collections
b
whereb[j]
is flattened (i.e. vertically concatenated) set of values selected bya[i][j]
over alli
So you mean this?
julia> using SplitApplyCombine
julia> a = [[[1,2], [3,4]],
[[5,6], [7,8]]]
2-element Vector{Vector{Vector{Int64}}}:
[[1, 2], [3, 4]]
[[5, 6], [7, 8]]
julia> flatten.(invert(a))
2-element Vector{Vector{Int64}}:
[1, 2, 5, 6]
[3, 4, 7, 8]
Currently flatten
and invert
only work on consecutive layers of nesting. So when you have multiple layers of nesting, you have to manage the "stack" like it's Forth code or something (and you basically end up with APL but with words instead of arcane symbols!). The above also has unnecessary intermediate temporaries which isn't ideal.
So yes if flatten
and invert
had something about the layer of nesting that you are referring to it could potentially be easier for users to read and write and more performant... I'm not sure what interface to suggest, though?
These are two other ways of writing it.
julia> map(flatten, invert(a))
2-element Array{Array{Int64,1},1}:
[1, 2, 5, 6]
[3, 4, 7, 8]
julia> invert(mapmany(invert, a))
2-element Array{Array{Int64,1},1}:
[1, 2, 5, 6]
[3, 4, 7, 8]
Not sure if one of these translates more naturally than the other (ByRow
is a bit like invert
, right?)
Thank you - I assumed there is some easy composition pattern 👍.
Here is another Discourse post that could maybe benefit from re-sizing with ByRow
. But flatten
after is fine.