DataFrames.jl Feature request: unstack multiple :values columns

in Pandas:

df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})

	paddockId	color	count	weight
0	0	red	3	0.2
1	0	blue	4	0.3
2	1	red	3	0.2
3	1	blue	4	0.3
4	2	red	3	0.2
5	2	blue	4	0.2

Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

	count		weight
paddockId	red	blue	red	blue
0	4	3	0.3	0.2
1	4	3	0.3	0.2
2	4	3	0.2	0.2

The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

Apr 28 '20 08:04 akdor1154

naïve implementation:

function DataFrames.unstack(df::DataFrame, rowKeys::AbstractVector{Symbol}, colKey::Symbol, valueCols::AbstractVector{Symbol}, renameCols)
    unstacked = [
        unstack(df, rowKeys, colKey, valueCol, renamecols=(c) -> renameCols(c, valueCol))
        for valueCol in valueCols
    ]
    return join(unstacked..., on=rowKeys)
end

Apr 28 '20 09:04 akdor1154

This is a duplicate of https://github.com/JuliaData/DataFrames.jl/issues/2148, so I am closing this issue. Please comment there if you find something more to add. Thank you for reporting this.

Apr 28 '20 09:04 bkamins

I don't think it's a duplicate - I read that issue as wanting to pass multiple columns as colKey. Here I want to pass multiple columns to value, which is different (unless I am missing some deep symmetry to this operation).

Apr 28 '20 10:04 akdor1154

OK - you are right.

Apr 28 '20 11:04 bkamins

+1, I've wanted to do this many times

Oct 20 '20 17:10 lkapelevich

in Pandas:
df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})
paddockId color count weight 0 0 red 3 0.2 1 0 blue 4 0.3 2 1 red 3 0.2 3 1 blue 4 0.3 4 2 red 3 0.2 5 2 blue 4 0.2 Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

count weight paddockId red blue red blue 0 4 3 0.3 0.2 1 4 3 0.3 0.2 2 4 3 0.2 0.2 The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

#2743 is trying to give a solution for some of similar problems.

May 01 '21 23:05 sl-solution

This would be very useful. Has there been progress on this in the meantime?

Feb 28 '22 10:02 jonas-schulze

No. But let us start with defining what we want exactly on a working example. Is this what you want:

julia> df = DataFrame(row=[1,1,2,2], col=[1,2,1,2], value1=1:4, value2=11:14)
4×4 DataFrame
 Row │ row    col    value1  value2
     │ Int64  Int64  Int64   Int64
─────┼──────────────────────────────
   1 │     1      1       1      11
   2 │     1      2       2      12
   3 │     2      1       3      13
   4 │     2      2       4      14

julia> unstack(select(df, :row, :col, AsTable(r"v") => Tables.rowtable => :value), :row, :col, :value)
2×3 DataFrame
 Row │ row    1                          2
     │ Int64  NamedTup…?                 NamedTup…?
─────┼─────────────────────────────────────────────────────────────
   1 │     1  (value1 = 1, value2 = 11)  (value1 = 2, value2 = 12)
   2 │     2  (value1 = 3, value2 = 13)  (value1 = 4, value2 = 14)

Feb 28 '22 11:02 bkamins

Almost. Instead of taking the data from a "stacked column" ((:col, :value) from your inner select), I would like to use several "unstacked columns" (:count and :weight). Using the opening example, I was hoping for unstack(df, :color, Cols(:count, :weight)) to yield df4 as in


julia> df = DataFrame(
           paddockId = [0, 0, 1, 1, 2, 2],
           color = repeat([:red, :blue], 3),
           count = repeat([3, 4], 3),
           weight = [0.2, 0.3, 0.2, 0.3, 0.2, 0.2],
       )
6×4 DataFrame
 Row │ paddockId  color   count  weight  
     │ Int64      Symbol  Int64  Float64 
─────┼───────────────────────────────────
   1 │         0  red         3      0.2
   2 │         0  blue        4      0.3
   3 │         1  red         3      0.2
   4 │         1  blue        4      0.3
   5 │         2  red         3      0.2
   6 │         2  blue        4      0.2

julia> df2 = select(
           df,
           :paddockId,
           :color,
           AsTable([:count, :weight]) => Tables.rowtable => :value,
       )
6×3 DataFrame
 Row │ paddockId  color   value                     
     │ Int64      Symbol  NamedTup…                 
─────┼──────────────────────────────────────────────
   1 │         0  red     (count = 3, weight = 0.2)
   2 │         0  blue    (count = 4, weight = 0.3)
   3 │         1  red     (count = 3, weight = 0.2)
   4 │         1  blue    (count = 4, weight = 0.3)
   5 │         2  red     (count = 3, weight = 0.2)
   6 │         2  blue    (count = 4, weight = 0.2)

julia> df3 = unstack(df2, :color, :value)
3×3 DataFrame
 Row │ paddockId  red                        blue                      
     │ Int64      NamedTup…?                 NamedTup…?                
─────┼─────────────────────────────────────────────────────────────────
   1 │         0  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   2 │         1  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   3 │         2  (count = 3, weight = 0.2)  (count = 4, weight = 0.2)

julia> df4 = select(df3,
           Not([:red, :blue]),
           :red => [:red_count, :red_weight],
           :blue => [:blue_count, :blue_weight],
       )
3×5 DataFrame
 Row │ paddockId  red_count  red_weight  blue_count  blue_weight 
     │ Int64      Int64      Float64     Int64       Float64     
─────┼───────────────────────────────────────────────────────────
   1 │         0          3         0.2           4          0.3
   2 │         1          3         0.2           4          0.3
   3 │         2          3         0.2           4          0.2

modulo the order of the columns, maybe. I'll describe that in the next comment to allow people to give separate feedback.

Feb 28 '22 14:02 jonas-schulze

Ideally, I would say:

the new columns [red|blue]_* should take the position of the now removed color column.
*_[count|weight] should be ordered as requested (i.e. as in AsTable([:count, :weight])) or as they are ordered in df (if the selector does not impose an order).
It would probably be best to configure the order red_*, blue_* via a keyword argument. Sensible options include "order of first appearance" (default), lexicographically (for String, Symbol, Real, Complex, and Tuple) and as in a given Vector containing (a superset of) the unique values of color.

All but the default for (3.) could be implemented later, though.

Feb 28 '22 14:02 jonas-schulze

Having some sort of "multi-level headers" oder indices would be nice, but I don't know what would be a nice user interface for that. Simply concatenating the column names would suffice for me (I would write the data to CSV and format the table header directly in LaTeX, I guess). Maybe this could be added on top later.

Feb 28 '22 14:02 jonas-schulze

"multi-level headers" are not possible to be supported in any near future (as opposed to pandas). We need to generate column names.

I was not rushing with the implementation of this request because we need to make the following design decisions. In general users want:

multiple value columns
multiple column key columns
multiple row key columns (we already have them and this is not problematic)

This requires us to decide on:

how do we name columns when unstack is on multiple columns and multiple values
handling of renamecols if multiple key columns are requested
handling of allowmissing if multiple key columns are requested
handling of allowduplicates if multiple key columns are requested
handling of fill kwarg if multiple value columns are requested
handling of valuestransform kwarg if multiple value columns are requested

and all these decisions need to be made before making any changes to make sure we will not have to make breaking changes later.

Feb 28 '22 15:02 bkamins

Example - unstack multiple value columns (Please remove if not helpful)

DF = DataFrame( A=[1,2,2,3], B=[10,10,10,20], C=[missing,1,1,missing], D=['W','X','Y','Z'], E=['w','x','y','z'] )

unstack( DF, :A, :B, [:C,:D,:E] ,allowduplicates=true)

should work like

hcat( 
                unstack( DF, :A, :B, :C  ,allowduplicates=true),
    select!( unstack( DF, :A, :B, :D  ,allowduplicates=true), Not(:A) ),       
    select!( unstack( DF, :A, :B, :E  ,allowduplicates=true), Not(:A) ),
    makeunique=true
)

producing

Row │ A      10       20       10_1     20_1     10_2     20_2
─────┼──────────────────────────────
   1 │     1  missing  missing  W        missing  w        missing
   2 │     2        1  missing  Y        missing  y        missing
   3 │     3  missing  missing  missing  Z        missing  z

Though as mentioned by others the column suffix should be given by the value field giving column names

Row│ A 10_C 20_C 10_D 20_D 10_E 20_E

I think the duplicate handling for the single value field implementation should work equally well for multiple value fields

Apr 14 '22 01:04 Lincoln-Hannah

I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)

Dec 05 '22 11:12 bkamins

DataFrames.jl DataFrames.jl copied to clipboard

Feature request: unstack multiple :values columns

DataFrames.jl
DataFrames.jl copied to clipboard