DataFrames.jl Automatic re-sizing with ByRow in combine

Automatic re-sizing with ByRow in combine

Open pdeffebach opened this issue 3 years ago • 22 comments

perhaps for combine and ByRow we should re-size tables

julia> df = DataFrame(a = [1, 2]);

julia> combine(groupby(df, :a), :a => (t -> [100, 200]) => :b)
4×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1    100
   2 │     1    200
   3 │     2    100
   4 │     2    200

julia> combine(groupby(df, :a), :a => ByRow(t -> [100, 200]) => :b)
2×2 DataFrame
 Row │ a      b          
     │ Int64  Array…     
─────┼───────────────────
   1 │     1  [100, 200]
   2 │     2  [100, 200]

No big deal if not. I actually think this change might not be that hard.

This would be a breaking change so it's fine to wait until much further down the line.

Nov 18 '20 18:11 pdeffebach

:a => ByRow(t -> [100, 200]) is an equivalent of (t -> [100, 200]).(df.a) and what you report is exactly expected. Otherwise there would be no way to obtain the result you present in your question if it were asked for.

Actually the hard case is when you have groups having more than one row in them, and then you need something like (I am writing it verbosely for clarity):

julia> df = DataFrame(a = [1, 1]);

julia> combine(groupby(df, :a), :a => (x -> reduce(vcat, (t -> [100, 200]).(x))) => :b)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1    100
   2 │     1    200
   3 │     1    100
   4 │     1    200

or use flatten for post processing

Nov 18 '20 18:11 bkamins

But as I wrote on Slack, could an alternative be passing something akin to => AsTable in this case, which would do the flattening automatically? Flattening is not hard per se, but it does require to spell out all the new column names again, which could get tedious quickly.

Maybe => Flattened or something like that?

Nov 19 '20 07:11 jkrumbiegel

@jkrumbiegel - we have more space here to show the examples. Can you please give the exact use case you have? I am hesitant to add Flattened option as I feel it would not be needed frequently. But maybe I am wrong - so it would be best to work on a concrete example to work out the best approach.

Nov 19 '20 08:11 bkamins

Sure, it is exactly the situation that one has a function returning multiple rows, which is applied ByRow. I think this is quite common, at least I encounter it frequently when cleaning or preparing dataframes for analysis.

You just need a dataframe in which a column contains some kind of complex object, which actually encodes data that belongs to separate rows.

Let me just repeat the example from Slack here so it doesn't get lost:

df = DataFrame(:participant => [1, 2], :blockdescription => ["a,false,b,true", "a,true,b,true"])

function extract_trials(blockdescription)
    temp = map(Iterators.partition(split(blockdescription, ","), 2)) do (condition, success)
        condition, parse(Bool, success)
    end
    (condition = first.(temp), success = last.(temp))
end

@pipe df |>
    groupby(_, :participant) |>
    combine(_, :blockdescription => ByRow(extract_trials) => AsTable)

This gives

2×3 DataFrame
 Row │ participant   condition                    success    
     │ Int64         Array…                       BitVector  
─────┼──────────────────────────────────────────────────────
   1 │           1   SubString{String}["a", "b"]  Bool[0, 1]
   2 │           2   SubString{String}["a", "b"]  Bool[1, 1]

Where it is an extra step of work to repeat the column names and flatten the dataframe via flatten(df, [:condition, :success]) to:

4×3 DataFrame
 Row │ participant   condition    success 
     │ Int64         String       Bool    
─────┼────────────────────────────────────
   1 │           1   a            false
   2 │           1   b            true
   3 │           2   a            true
   4 │           2   b            true

So I think it's not uncommon to have a row-wise computation resulting in several output rows, and there should be a standard idiom for that, instead of having to go in manually and repeating the newly created column names in a flatten command.

A sink descriptor that says "the output of this call consists of multiple rows each time, and all new columns should be flattened".

Nov 19 '20 08:11 jkrumbiegel

Just to clarify - I have a problem with your example that each group contains only one row. Is this on purpose or not? Do you want in general to allow for multiple rows per group?

The question is that ByRow is designed to process things rowwise, so in your case a more natural thing to write for me would be:

julia> df = DataFrame(:participant => [1, 2], :blockdescription => ["a,false,b,true", "a,true,b,true"])
2×2 DataFrame
 Row │ participant  blockdescription
     │ Int64        String
─────┼───────────────────────────────
   1 │           1  a,false,b,true
   2 │           2  a,true,b,true

julia> df.id = axes(df, 1)
Base.OneTo(2)

julia> function extract_trials(blockdescription)
           temp = map(Iterators.partition(split(blockdescription, ","), 2)) do (condition, success)
               condition, parse(Bool, success)
           end
           (condition = first.(temp), success = last.(temp))
       end
extract_trials (generic function with 1 method)

julia> @pipe df |>
           groupby(_, :id) |>
           combine(_, :blockdescription => (x -> extract_trials(only(x))) => AsTable)
4×3 DataFrame
 Row │ id     condition  success
     │ Int64  SubStrin…  Bool
─────┼───────────────────────────
   1 │     1  a            false
   2 │     1  b             true
   3 │     2  a             true
   4 │     2  b             true

the point is that in your pipeline you do not take into account the grouping variable in any way (there is no logical link between rows in a single group), so the question is why is it (i.e. the group-by step) present there at all?

Nov 19 '20 09:11 bkamins

It should also work for multiple rows, in this example there is just one. You are right that the example is not very good in that way, that the grouping is equivalent to doing a row-wise transformation. I have to think about it more, whether my problem stems from a missing feature in DataFrames, or my own misconception of the problem here.

So far, I still believe that it's useful to have ByRow return multiple rows, no matter if that's happening with or without groupby. Of course one can always write the concatenation logic and pass that function without ByRow, but it saves a lot of work not having to write the vector form, and not having to do the concatenation.

Nov 19 '20 11:11 jkrumbiegel

I have to think about it more, whether my problem stems from a missing feature in DataFrames, or my own misconception of the problem here.

To be clear - if we decide this is something that is needed I am OK to add it, just it cannot be added to ByRow, but it must be a separate mechanism (and then the question is if it is common enough to justify its existence).

and not having to do the concatenation.

Indeed this is what combine does by default on GroupedDataFrame by design if it gets a multiple-row output (and I take advantage of this fact in my solution).

However, the question is:

do we need a special syntax in DataFrames.jl mini-language (this should not be done lightly)
or we can have a lightweight function that does the concatenation and then you can just compose this in your function call (this is what I would prefer)

So to be clear instead of

:col => ByRow(some_fun) => Flatten(:outcol_name)

I would prefer to have

:col => (x -> generic_flatten_function(some_fun.(x))) => :outcol_name

as this is more flexible. The reason is that Flatten() behavior has to be hardcoded in DataFrames.jl, while generic_flatten_fucntion can live on its own independent on DataFrames.jl. This is much better as DataFrames.jl is a core package and generic_flatten_fucntion is a function that different users might want to work differently, so it is better not to hard-code it into the package. The rules in https://dataframes.juliadata.org/latest/man/split_apply_combine/#The-Split-Apply-Combine-Strategy are alredy very complex.

For vectors generic_flatten_function is just reduce(vcat, ...). For NamedTuples there is no such function AFAICT but it can be easily added.

Nov 19 '20 12:11 bkamins

@quinnj - actually I think TableOperations.joinpartitions could be used for this. Could you please comment what would be the preferred way to use it to convert:

[((a=[1], b=[2]),(a=[3], b=[4])]

into a table with columns :a and :b containing [1,2] and [3,4] respectively?

Thank you!

Nov 19 '20 12:11 bkamins

I still think this is a good idea.

a => ByRow(t -> [100, 200]) is an equivalent of (t -> [100, 200]).(df.a) and what you report is exactly expected. Otherwise there would be no way to obtain the result you present in your question if it were asked for.

Without ByRow, you are required to use combine(df, :a => (t -> Ref([100, 200])) to get a vector of vectors. We can make the same requirement with ByRow, making things more consistent.

I don't think adding Flatten is a good idea. It's too niche a use to have a special syntax for. However I think flattening the column is the most consistent default
The key here is not that each group has one row. The key is that combine re-sizes the data frame while other functions do not. You could have combine(df, :a => ByRow(t -> [100, 200])) and it would produce a data frame with 2x the number of rows as the original.
To be clear - if we decide this is something that is needed I am OK to add it, just it cannot be added to ByRow, but it must be a separate mechanism (and then the question is if it is common enough to justify its existence).

I don't fully understand this logic. In my mind, ByRow only guarantees the input is a NamedTuple, but doesn't have to enforce anything about the output.

Nonetheless, I am fine with the current behavior until after 1.0.

Nov 19 '20 14:11 pdeffebach

Just to be clear:

combine(_, :blockdescription => ByRow(extract_trials) => AsTable)

is exactly the same as

combine(_, :blockdescription => (x -> extract_trials.(x)) => AsTable)

and this will not change. (DataFrames.jl is not at the stage of design any more 😄)

So the discussion is how should:

combine(_, :blockdescription => (x -> extract_trials.(x)) => AsTable)

be treated. And again - this is already settled - we have a set of rules how "a vector of something" is processed and this will not change.

What we could do is either:

add new "verbs" or "nouns" in our mini-language to change how "a vector of something" is processed
add utility functions that process "a vector of something" into "a vector of something else" (without affecting the mini-language)

And I am just saying that I prefer option 2. over option 1., as the mini-language should be as minimal as possible (there are already requests to add AsVector, RowNumber, proprow). The difference between these three things and the thing that is requested here is that "flattening" can be done without using the mini-language, whereas AsVector, RowNumber, proprow have to be built-in, as otherwise the things that they are intended to produce are very difficult to do without them.

The key is that combine re-sizes the data frame while other functions do not.

This is an orthogonal issue. The thing that is different is that select and transform are target shape aware (and enforce this shape) while combine accepts any output shape. In particular this means that ByRow must work the same way both for combine and select because ByRow definition is independent from which function calls it. ByRow can be even used without select or combine present at all, e.g.:

julia> ByRow(sin)(1:10)
10-element Array{Float64,1}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
 -0.7568024953079282
 -0.9589242746631385
 -0.27941549819892586
  0.6569865987187891
  0.9893582466233818
  0.4121184852417566
 -0.5440211108893698

and it is essentially a broadcasting operation:

julia> @code_warntype ByRow(sin)(1:10)
Variables
  f::Core.Compiler.Const(ByRow{typeof(sin)}(sin), false)
  cols::Tuple{UnitRange{Int64}}

Body::Array{Float64,1}
1 ─ %1 = Base.getproperty(f, :fun)::Core.Compiler.Const(sin, false)
│   %2 = Core.tuple(%1)::Core.Compiler.Const((sin,), false)
│   %3 = Core._apply_iterate(Base.iterate, Base.broadcasted, %2, cols)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(sin),Tuple{UnitRange{Int64}}}
│   %4 = Base.materialize(%3)::Array{Float64,1}
└──      return %4

julia> @code_warntype sin.(1:10)
Variables
  #self#::Core.Compiler.Const(var"##dotfunction#253#1"(), false)
  x1::UnitRange{Int64}

Body::Array{Float64,1}
1 ─ %1 = Base.broadcasted(Main.sin, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(sin),Tuple{UnitRange{Int64}}}
│   %2 = Base.materialize(%1)::Array{Float64,1}
└──      return %2

Is this now clearer what I want to say?

Nov 19 '20 15:11 bkamins

Just to be clearer what I want to say is that one can write:

julia> df = DataFrame(a = [1, 2]);

julia> flattenme(x) = reduce(vcat, x)
flattenme (generic function with 1 method)

julia> combine(groupby(df, :a), :a => flattenme∘ByRow(t -> [100, 200]) => :b)
4×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1    100
   2 │     1    200
   3 │     2    100
   4 │     2    200

and flattenme is not a part of the mini language, but gets the job done. Exactly the same way how skipmissing works for skipping missing values. We do not have to have skipmissing as a part of the mini-language to use it.

Nov 19 '20 15:11 bkamins

and this will not change. (DataFrames.jl is not at the stage of design any more )

I think this is the most important part. If I had noticed this a few months ago, i would have pushed harder for this change, as I don't think the "consistent with broadcast" reasoning is more important than the consistency with non-ByRow. But this is a fair rule and not too mentally complicated.

I agree that flattenme would be good to have. I will think about where it should live.

Nov 19 '20 15:11 pdeffebach

then flattenme should not live in DataFrames.jl as it is a general function for processing tables (that is why I asked @quinnj a qauestion if we already do not have it in TableOperations.jl).

Nov 19 '20 15:11 bkamins

"consistent with broadcast" reasoning is more important than the consistency with non-ByRow

Yes - a few months ago we could have changed the definition of ByRow. But currently the definition of ByRow is that it is a shorthand for broadcast and it will stay.

Nov 19 '20 15:11 bkamins

This came up on discourse today, with the complication that the ByRow function returns several columns:

julia> df = DataFrame(a=[1,2])
2×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2

julia> f(x) = [[10x,11x], [100x,111x]];

julia> combine(groupby(df, :a), :a => ByRow(f) => [:b, :c])
2×3 DataFrame
 Row │ a      b         c          
     │ Int64  Array…    Array…     
─────┼─────────────────────────────
   1 │     1  [10, 11]  [100, 111]
   2 │     2  [20, 22]  [200, 222]

This requires a flatten post-processing. It would be nice if flattenme could also support this case, but the proposed implementation gives a "wrong" result:

julia> flattenme(x) = reduce(vcat, x);

julia> combine(groupby(df, :a), :a => flattenme∘ByRow(f) => [:b, :c])
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     10     11
   2 │     1    100    111
   3 │     2     20     22
   4 │     2    200    222

A possibility is to transform the function's output into a matrix before further processing:

julia> combine(groupby(df, :a), :a => flattenme∘ByRow(Base.splat(hcat)∘f) => [:b, :c])
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1     10    100
   2 │     1     11    111
   3 │     2     20    200
   4 │     2     22    222

But the whole thing is rather involved... It would be nice if there was a more intuitive/explicit fix for combine(groupby(df, :a), :a => ByRow(f) => [:b, :c]) to produce flattened output?

Jul 03 '21 13:07 knuesel

It would be nice if there was a more intuitive/explicit fix for combine(groupby(df, :a), :a => ByRow(f) => [:b, :c]) to produce flattened output?

My feeling is that using flatten for post processing is most readable.

If one wants this in a transformation I still feel it is better to define a special function that does the flattening rather than extending the mini-language (which already is quite complex).

In general such a function seems natural to be added to SplitApplyCombine.jl. Maybe @andyferris would agree to add dims argument to SplitApplyCombine.flatten so that would work the way it is asked for?

Jul 03 '21 14:07 bkamins

Sorry - I'm a bit behind and am still trying to understand this. I'm happy to try and help.

However, I'm a little lost what is being asked for?

If you want to flatten multi-dimensional arrays, there is combinedims which we could extend with a dims argument to give you more control over which dimensions of the inner arrays are brought out, the order of the dimensions, etc. That's not precisely what you want here, but you are modelling a grouped dataframe as a bit like a 1D collection of 2D collections, right?

Is it more of a multiple nesting thing where you want to convert collection[a][b][c] to be new_collection[a,c][b]? (Perhaps not a,c exactly but at least that kind of ordering). Or you want to control the order better to correct the fault above so it's like collection[a][b,c] -> new_collection[a,c,b] kind of ordering?

(Sorry if that notation is super confusing)

Jul 04 '21 02:07 andyferris

The users ask to transform:

[[[1,2], [3,4]],
[[5,6], [7,8]]]

into

[[1,2,5,6], [3,4,7,8]]

In Julia Base the transformation to do it is:

julia> x = [[[1,2], [3,4]],
            [[5,6], [7,8]]]
2-element Vector{Vector{Vector{Int64}}}:
 [[1, 2], [3, 4]]
 [[5, 6], [7, 8]]

julia> [reduce(vcat, getindex.(x, i)) for i in 1:length(x[1])]
2-element Vector{Vector{Int64}}:
 [1, 2, 5, 6]
 [3, 4, 7, 8]

but it is a bit awkward.

Generally the idea is:

one gets a collection of collections of collections a
each element a[i] represents a row, and has the same length
each element of a[i][j] is a row entry in column j and is a collection
the result should be a collection of collections b where b[j] is flattened (i.e. vertically concatenated) set of values selected by a[i][j] over all i

Jul 04 '21 06:07 bkamins

So you mean this?

julia> using SplitApplyCombine

julia> a = [[[1,2], [3,4]],
       [[5,6], [7,8]]]
2-element Vector{Vector{Vector{Int64}}}:
 [[1, 2], [3, 4]]
 [[5, 6], [7, 8]]

julia> flatten.(invert(a))
2-element Vector{Vector{Int64}}:
 [1, 2, 5, 6]
 [3, 4, 7, 8]

Currently flatten and invert only work on consecutive layers of nesting. So when you have multiple layers of nesting, you have to manage the "stack" like it's Forth code or something (and you basically end up with APL but with words instead of arcane symbols!). The above also has unnecessary intermediate temporaries which isn't ideal.

So yes if flatten and invert had something about the layer of nesting that you are referring to it could potentially be easier for users to read and write and more performant... I'm not sure what interface to suggest, though?

Jul 04 '21 12:07 andyferris

These are two other ways of writing it.

julia> map(flatten, invert(a))
2-element Array{Array{Int64,1},1}:
 [1, 2, 5, 6]
 [3, 4, 7, 8]

julia> invert(mapmany(invert, a))
2-element Array{Array{Int64,1},1}:
 [1, 2, 5, 6]
 [3, 4, 7, 8]

Not sure if one of these translates more naturally than the other (ByRow is a bit like invert, right?)

Jul 04 '21 13:07 andyferris

Thank you - I assumed there is some easy composition pattern 👍.

Jul 04 '21 15:07 bkamins

Here is another Discourse post that could maybe benefit from re-sizing with ByRow. But flatten after is fine.

Oct 01 '21 14:10 pdeffebach

DataFrames.jl DataFrames.jl copied to clipboard

Automatic re-sizing with ByRow in combine

DataFrames.jl
DataFrames.jl copied to clipboard