DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Stacking multiple groups of columns

Open tk3369 opened this issue 4 years ago • 7 comments

I have a data frame that contains multiple groups of columns and I want to stack them up. If there were only one group, then the regular stack function would have worked. But, I have two groups as shown in the following MWE:

df = DataFrame(x=1:3, y1=4:6, y2=7:9, z1=10:12, z2=13:15)

So the data looks like this:

│ Row │ x     │ y1    │ y2    │ z1    │ z2    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 7     │ 10    │ 13    │
│ 2   │ 2     │ 5     │ 8     │ 11    │ 14    │
│ 3   │ 3     │ 6     │ 9     │ 12    │ 15    │

And, I need to transform it to something like this.

│ Row │ x     │ order │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 4     │ 10    │
│ 2   │ 1     │ 2     │ 7     │ 13    │
│ 3   │ 2     │ 1     │ 5     │ 11    │
│ 4   │ 2     │ 2     │ 8     │ 13    │
│ 5   │ 3     │ 1     │ 6     │ 12    │
│ 6   │ 3     │ 2     │ 9     │ 15    │

My current solution is somewhat manual:

df1 = select(df, :x, :x => (_->1) => :order, :y1 => :y, :z1 => :z)
df2 = select(df, :x, :x => (_->2) => :order, :y2 => :y, :z2 => :z)
vcat(df1, df2)

Is it possible to make it more generic since it's really a stacking function?

Just an idea:

stack(df, [[:y1, :z1] => 1, [:y2, :z2] => 2]; variable_name = :order, value_names = [:y, :z])

The values 1 and 2 above are really group names for the columns. In this case, the group is called order.

tk3369 avatar Sep 05 '20 16:09 tk3369

I think the proposal makes sense and thinking of the API what you proposed is roughly the simplest thing that can be proposed.

bkamins avatar Sep 05 '20 19:09 bkamins

Hi! I just want to comment to keep track of this issue. I often encounter instances where this feature would be useful and wonder if there's been any progress. I believe that this relates to a previous discourse question.

MatthewRGonzalez avatar Aug 31 '22 17:08 MatthewRGonzalez

It is tracked. I have now increased the priority to 1.5 release.

Before discussing the API let me first comment what would be my natural way of doing what you want without this change:

julia> @chain df begin
           select(:x, [[:y1, :z1], [:y2, :z2]] .=> ByRow(tuple) .=> ["1", "2"])
           stack(["1", "2"])
           select(:x, :variable, :value => [:y, :z])
       end
6×4 DataFrame
 Row │ x      variable  y      z
     │ Int64  String    Int64  Int64
─────┼───────────────────────────────
   1 │     1  1             4     10
   2 │     2  1             5     11
   3 │     3  1             6     12
   4 │     1  2             7     13
   5 │     2  2             8     14
   6 │     3  2             9     15

(with a different order but the order can be changed with sorting; the originally proposed order is inconsistent with stack design)

Given this let us discuss the request:

  1. Do you and @tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)
  2. If we were to add it I would choose something like:
stack(df, ([:y1, :z1] => 1, [:y2, :z2] => 2); variable_name = :order, value_name = [:y, :z])

(note that I use Tuple as a second argument - the reason is that it makes dispatch easier and avoids surprise conversions, see example below) Indeed an additional benefit of this API would be that one could write:

stack(df, (:y1 => 1, :y2 => 2); variable_name = :order, value_name = :y)

to conveniently give values in variable column. We would need to think if in this case the view option should be allowed (this is probably doable, but adds another layer of complexity)

Example of surprise conversion showing why vector is problematic:

julia> [:x => true, :y => 1]
2-element Vector{Pair{Symbol, Int64}}:
 :x => 1
 :y => 1

CC @nalimilan

bkamins avatar Aug 31 '22 19:08 bkamins

Thank you for the discussion!

Do you and tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)

I find these types of transformations to be useful and would certainly appreciate the convenience of this added functionality within the stack function. The approach you provide above is certainly sufficient, though perhaps less intuitive for new Julia and DataFrames.jl users. With this said, I'm unsure about the costs associated with adding complexity to the API, so it's difficult for me to provide a complete justification.

In terms of implementation, I fully agree with your proposal above (including the use of the Tuple as the second argument). It seems intuitive. I do not feel that the view would be particularly necessary for most cases.

MatthewRGonzalez avatar Aug 31 '22 22:08 MatthewRGonzalez

With this said, I'm unsure about the costs associated with adding complexity to the API

I have already thought about it - it is not that hard, so I think we can add it in 1.5.

bkamins avatar Sep 01 '22 05:09 bkamins

Makes sense. I wonder whether the tidyverse's pivot_longer handles this. From a quick that doesn't seem to be the case.

nalimilan avatar Sep 01 '22 09:09 nalimilan

For me a more natural way to think about this is via nested columns (per example I have given) but if users prefer unnested columns we can add it.

bkamins avatar Sep 01 '22 11:09 bkamins

I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)

bkamins avatar Dec 05 '22 11:12 bkamins