DataFrames.jl
DataFrames.jl copied to clipboard
Stacking multiple groups of columns
I have a data frame that contains multiple groups of columns and I want to stack them up. If there were only one group, then the regular stack
function would have worked. But, I have two groups as shown in the following MWE:
df = DataFrame(x=1:3, y1=4:6, y2=7:9, z1=10:12, z2=13:15)
So the data looks like this:
│ Row │ x │ y1 │ y2 │ z1 │ z2 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 4 │ 7 │ 10 │ 13 │
│ 2 │ 2 │ 5 │ 8 │ 11 │ 14 │
│ 3 │ 3 │ 6 │ 9 │ 12 │ 15 │
And, I need to transform it to something like this.
│ Row │ x │ order │ y │ z │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 4 │ 10 │
│ 2 │ 1 │ 2 │ 7 │ 13 │
│ 3 │ 2 │ 1 │ 5 │ 11 │
│ 4 │ 2 │ 2 │ 8 │ 13 │
│ 5 │ 3 │ 1 │ 6 │ 12 │
│ 6 │ 3 │ 2 │ 9 │ 15 │
My current solution is somewhat manual:
df1 = select(df, :x, :x => (_->1) => :order, :y1 => :y, :z1 => :z)
df2 = select(df, :x, :x => (_->2) => :order, :y2 => :y, :z2 => :z)
vcat(df1, df2)
Is it possible to make it more generic since it's really a stacking function?
Just an idea:
stack(df, [[:y1, :z1] => 1, [:y2, :z2] => 2]; variable_name = :order, value_names = [:y, :z])
The values 1
and 2
above are really group names for the columns. In this case, the group is called order
.
I think the proposal makes sense and thinking of the API what you proposed is roughly the simplest thing that can be proposed.
Hi! I just want to comment to keep track of this issue. I often encounter instances where this feature would be useful and wonder if there's been any progress. I believe that this relates to a previous discourse question.
It is tracked. I have now increased the priority to 1.5 release.
Before discussing the API let me first comment what would be my natural way of doing what you want without this change:
julia> @chain df begin
select(:x, [[:y1, :z1], [:y2, :z2]] .=> ByRow(tuple) .=> ["1", "2"])
stack(["1", "2"])
select(:x, :variable, :value => [:y, :z])
end
6×4 DataFrame
Row │ x variable y z
│ Int64 String Int64 Int64
─────┼───────────────────────────────
1 │ 1 1 4 10
2 │ 2 1 5 11
3 │ 3 1 6 12
4 │ 1 2 7 13
5 │ 2 2 8 14
6 │ 3 2 9 15
(with a different order but the order can be changed with sorting; the originally proposed order is inconsistent with stack
design)
Given this let us discuss the request:
- Do you and @tk3369 think that the request is needed often enough to justify complicating
stack
API (given the example above how it can be achieved using current functionalities) - If we were to add it I would choose something like:
stack(df, ([:y1, :z1] => 1, [:y2, :z2] => 2); variable_name = :order, value_name = [:y, :z])
(note that I use Tuple
as a second argument - the reason is that it makes dispatch easier and avoids surprise conversions, see example below)
Indeed an additional benefit of this API would be that one could write:
stack(df, (:y1 => 1, :y2 => 2); variable_name = :order, value_name = :y)
to conveniently give values in variable
column.
We would need to think if in this case the view
option should be allowed (this is probably doable, but adds another layer of complexity)
Example of surprise conversion showing why vector is problematic:
julia> [:x => true, :y => 1]
2-element Vector{Pair{Symbol, Int64}}:
:x => 1
:y => 1
CC @nalimilan
Thank you for the discussion!
Do you and tk3369 think that the request is needed often enough to justify complicating stack API (given the example above how it can be achieved using current functionalities)
I find these types of transformations to be useful and would certainly appreciate the convenience of this added functionality within the stack
function. The approach you provide above is certainly sufficient, though perhaps less intuitive for new Julia and DataFrames.jl users. With this said, I'm unsure about the costs associated with adding complexity to the API, so it's difficult for me to provide a complete justification.
In terms of implementation, I fully agree with your proposal above (including the use of the Tuple as the second argument). It seems intuitive. I do not feel that the view
would be particularly necessary for most cases.
With this said, I'm unsure about the costs associated with adding complexity to the API
I have already thought about it - it is not that hard, so I think we can add it in 1.5.
Makes sense. I wonder whether the tidyverse's pivot_longer
handles this. From a quick that doesn't seem to be the case.
For me a more natural way to think about this is via nested columns (per example I have given) but if users prefer unnested columns we can add it.
I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)