DataFrames.jl
DataFrames.jl copied to clipboard
Feature request: unstack multiple :values columns
in Pandas:
df = pd.DataFrame({
'paddockId': [0, 0, 1, 1, 2, 2],
'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
'count': [3, 4, 3, 4, 3, 4],
'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})
paddockId | color | count | weight | |
---|---|---|---|---|
0 | 0 | red | 3 | 0.2 |
1 | 0 | blue | 4 | 0.3 |
2 | 1 | red | 3 | 0.2 |
3 | 1 | blue | 4 | 0.3 |
4 | 2 | red | 3 | 0.2 |
5 | 2 | blue | 4 | 0.2 |
Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')
count | weight | |||
---|---|---|---|---|
paddockId | red | blue | red | blue |
0 | 4 | 3 | 0.3 | 0.2 |
1 | 4 | 3 | 0.3 | 0.2 |
2 | 4 | 3 | 0.2 | 0.2 |
The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values
argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName)
(so in this instance, (color, valueCol) -> ...
e.g. ('red', :count') -> ....
?
naïve implementation:
function DataFrames.unstack(df::DataFrame, rowKeys::AbstractVector{Symbol}, colKey::Symbol, valueCols::AbstractVector{Symbol}, renameCols)
unstacked = [
unstack(df, rowKeys, colKey, valueCol, renamecols=(c) -> renameCols(c, valueCol))
for valueCol in valueCols
]
return join(unstacked..., on=rowKeys)
end
This is a duplicate of https://github.com/JuliaData/DataFrames.jl/issues/2148, so I am closing this issue. Please comment there if you find something more to add. Thank you for reporting this.
I don't think it's a duplicate - I read that issue as wanting to pass multiple columns as colKey
. Here I want to pass multiple columns to value
, which is different (unless I am missing some deep symmetry to this operation).
OK - you are right.
+1, I've wanted to do this many times
in Pandas:
df = pd.DataFrame({ 'paddockId': [0, 0, 1, 1, 2, 2], 'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'], 'count': [3, 4, 3, 4, 3, 4], 'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2] })
paddockId color count weight 0 0 red 3 0.2 1 0 blue 4 0.3 2 1 red 3 0.2 3 1 blue 4 0.3 4 2 red 3 0.2 5 2 blue 4 0.2 Given this, you can unstack on multiple value columns by just passing e.g.
df.set_index(['paddockId', 'color']).unstack('color')
count weight paddockId red blue red blue 0 4 3 0.3 0.2 1 4 3 0.3 0.2 2 4 3 0.2 0.2 The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the
values
argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of(colKey value, valueColumnName)
(so in this instance,(color, valueCol) -> ...
e.g.('red', :count') -> ....
?
#2743 is trying to give a solution for some of similar problems.
This would be very useful. Has there been progress on this in the meantime?
No. But let us start with defining what we want exactly on a working example. Is this what you want:
julia> df = DataFrame(row=[1,1,2,2], col=[1,2,1,2], value1=1:4, value2=11:14)
4×4 DataFrame
Row │ row col value1 value2
│ Int64 Int64 Int64 Int64
─────┼──────────────────────────────
1 │ 1 1 1 11
2 │ 1 2 2 12
3 │ 2 1 3 13
4 │ 2 2 4 14
julia> unstack(select(df, :row, :col, AsTable(r"v") => Tables.rowtable => :value), :row, :col, :value)
2×3 DataFrame
Row │ row 1 2
│ Int64 NamedTup…? NamedTup…?
─────┼─────────────────────────────────────────────────────────────
1 │ 1 (value1 = 1, value2 = 11) (value1 = 2, value2 = 12)
2 │ 2 (value1 = 3, value2 = 13) (value1 = 4, value2 = 14)
Almost. Instead of taking the data from a "stacked column" ((:col, :value)
from your inner select
), I would like to use several "unstacked columns" (:count
and :weight
). Using the opening example, I was hoping for unstack(df, :color, Cols(:count, :weight))
to yield df4
as in
julia> df = DataFrame(
paddockId = [0, 0, 1, 1, 2, 2],
color = repeat([:red, :blue], 3),
count = repeat([3, 4], 3),
weight = [0.2, 0.3, 0.2, 0.3, 0.2, 0.2],
)
6×4 DataFrame
Row │ paddockId color count weight
│ Int64 Symbol Int64 Float64
─────┼───────────────────────────────────
1 │ 0 red 3 0.2
2 │ 0 blue 4 0.3
3 │ 1 red 3 0.2
4 │ 1 blue 4 0.3
5 │ 2 red 3 0.2
6 │ 2 blue 4 0.2
julia> df2 = select(
df,
:paddockId,
:color,
AsTable([:count, :weight]) => Tables.rowtable => :value,
)
6×3 DataFrame
Row │ paddockId color value
│ Int64 Symbol NamedTup…
─────┼──────────────────────────────────────────────
1 │ 0 red (count = 3, weight = 0.2)
2 │ 0 blue (count = 4, weight = 0.3)
3 │ 1 red (count = 3, weight = 0.2)
4 │ 1 blue (count = 4, weight = 0.3)
5 │ 2 red (count = 3, weight = 0.2)
6 │ 2 blue (count = 4, weight = 0.2)
julia> df3 = unstack(df2, :color, :value)
3×3 DataFrame
Row │ paddockId red blue
│ Int64 NamedTup…? NamedTup…?
─────┼─────────────────────────────────────────────────────────────────
1 │ 0 (count = 3, weight = 0.2) (count = 4, weight = 0.3)
2 │ 1 (count = 3, weight = 0.2) (count = 4, weight = 0.3)
3 │ 2 (count = 3, weight = 0.2) (count = 4, weight = 0.2)
julia> df4 = select(df3,
Not([:red, :blue]),
:red => [:red_count, :red_weight],
:blue => [:blue_count, :blue_weight],
)
3×5 DataFrame
Row │ paddockId red_count red_weight blue_count blue_weight
│ Int64 Int64 Float64 Int64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ 0 3 0.2 4 0.3
2 │ 1 3 0.2 4 0.3
3 │ 2 3 0.2 4 0.2
modulo the order of the columns, maybe. I'll describe that in the next comment to allow people to give separate feedback.
Ideally, I would say:
- the new columns
[red|blue]_*
should take the position of the now removedcolor
column. -
*_[count|weight]
should be ordered as requested (i.e. as inAsTable([:count, :weight])
) or as they are ordered indf
(if the selector does not impose an order). - It would probably be best to configure the order
red_*
,blue_*
via a keyword argument. Sensible options include "order of first appearance" (default), lexicographically (forString
,Symbol
,Real
,Complex
, andTuple
) and as in a givenVector
containing (a superset of) the unique values ofcolor
.
All but the default for (3.) could be implemented later, though.
Having some sort of "multi-level headers" oder indices would be nice, but I don't know what would be a nice user interface for that. Simply concatenating the column names would suffice for me (I would write the data to CSV and format the table header directly in LaTeX, I guess). Maybe this could be added on top later.
"multi-level headers" are not possible to be supported in any near future (as opposed to pandas). We need to generate column names.
I was not rushing with the implementation of this request because we need to make the following design decisions. In general users want:
- multiple value columns
- multiple column key columns
- multiple row key columns (we already have them and this is not problematic)
This requires us to decide on:
- how do we name columns when unstack is on multiple columns and multiple values
- handling of
renamecols
if multiple key columns are requested - handling of
allowmissing
if multiple key columns are requested - handling of
allowduplicates
if multiple key columns are requested - handling of
fill
kwarg if multiple value columns are requested - handling of
valuestransform
kwarg if multiple value columns are requested
and all these decisions need to be made before making any changes to make sure we will not have to make breaking changes later.
Example - unstack multiple value columns (Please remove if not helpful)
DF = DataFrame( A=[1,2,2,3], B=[10,10,10,20], C=[missing,1,1,missing], D=['W','X','Y','Z'], E=['w','x','y','z'] )
unstack( DF, :A, :B, [:C,:D,:E] ,allowduplicates=true)
should work like
hcat(
unstack( DF, :A, :B, :C ,allowduplicates=true),
select!( unstack( DF, :A, :B, :D ,allowduplicates=true), Not(:A) ),
select!( unstack( DF, :A, :B, :E ,allowduplicates=true), Not(:A) ),
makeunique=true
)
producing
Row │ A 10 20 10_1 20_1 10_2 20_2
─────┼──────────────────────────────
1 │ 1 missing missing W missing w missing
2 │ 2 1 missing Y missing y missing
3 │ 3 missing missing missing Z missing z
Though as mentioned by others the column suffix should be given by the value field giving column names
Row│ A 10_C 20_C 10_D 20_D 10_E 20_E
I think the duplicate handling for the single value field implementation should work equally well for multiple value fields
I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)