DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Stack/Melt over multiple sets of variables

Open pmarg opened this issue 5 years ago • 2 comments

It would be a useful feature if stack and melt could be implemented over multiple sets of variables. For example from this:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

To this:

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

This is implemented in R and Stata via pivot_longer and reshape long respectively.

@bkamins suggested the following for a simpler name:

df = DataFrame(ID=1:2, A2018=1:2, A2019=1:2, B2018=1:2, B2019=1:2)
df2 = stack(df, 2:5)
df2.colkey = first.(String.(df2.variable), 1)
df2.Year = parse.(Int, chop.(String.(df2.variable), head=1, tail=0))
unstack(df2, [:ID, :Year], :colkey, :value)

For a more general application, users need to rename the columns with a specific pattern for the function to split the name and the value such as varA_2018.

One idea is to make use of Regex e.g. :

m=match(r"(?<Name>\w+)_(?<Value>\d+)",colname)

The function would melt the DataFrame as many times as the specified set of columns over the specified ID (in the above example 2), rename the column of the value to m[1] (in this case :varA and :varB), replace the name of the variable to :Year and the values of the column :Year to m[2] and finally merge the intermediate DataFrames over :ID and :Year.

df_longer = melt(df, :ID, [:varA2018, :varA2019], [:varB2018, :varB2019], variable_name =:Year)

The second method is to specify the name of the new variable as an argument in melt :

df_longer = metl(df, :ID, [:VarA, :VarB],variable_name =:Year)

Using a regular expression the function will melt the DataFrame using the columns that start with "varA" and "varB" respectively, rename and replace the value as above and then merge the DataFrames over :ID and :Year.

I am not very good at writing functions for packages, especially regular expressions, but if you need any help to implement this I would be glad to help the best that I can.

pmarg avatar Jun 08 '19 09:06 pmarg

I have written for myself a function that melts over multiple variables but I am not very good at optimizing code.

function pivot_longer(df::AbstractDataFrame,bycol::Symbol,cols::Vector{Symbol};period = :Period)
  temp = nothing
  df_longer = nothing
  m = nothing
  @inbounds for col ∈ cols
    temp = select(df,Regex("[$(col)]"))
    @inbounds for colname ∈ names(temp)
      m = match(r"(?<Name>\w+)_(?<Value>\d+)",String(colname))
      rename!(temp,colname=>Symbol(m[2]))
    end
    temp[!,bycol] = df[!,bycol]
    temp = melt(temp,bycol)
    rename!(temp,:variable => period, :value => Symbol(m[1]))
    if df_longer == nothing
      df_longer = temp
    else
      df_longer = join(df_longer, temp, on = [bycol,period], kind = :outer)
    end
  end
  return df_longer
end

Example:

df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12, B_2018=9:12, B_2019 = [missing,13,14,15], ID = [1,2,3,4])

4×6 DataFrame
│ Row │ A_2018 │ A_2019 │ B_2017 │ B_2018 │ B_2019  │ ID    │
│     │ Int64  │ Int64  │ Int64  │ Int64  │ Int64⍰  │ Int64 │
├─────┼────────┼────────┼────────┼────────┼─────────┼───────┤
│ 1   │ 1      │ 5      │ 9      │ 9      │ missing │ 1     │
│ 2   │ 2      │ 6      │ 10     │ 10     │ 13      │ 2     │
│ 3   │ 3      │ 7      │ 11     │ 11     │ 14      │ 3     │
│ 4   │ 4      │ 8      │ 12     │ 12     │ 15      │ 4     │


df_longer = pivot_longer(df,:ID,[:A,:B],period = :Year)

12×4 DataFrame
│ Row │ Year   │ A       │ ID    │ B       │
│     │ Symbol │ Int64⍰  │ Int64 │ Int64⍰  │
├─────┼────────┼─────────┼───────┼─────────┤
│ 1   │ 2018   │ 1       │ 1     │ 9       │
│ 2   │ 2018   │ 2       │ 2     │ 10      │
│ 3   │ 2018   │ 3       │ 3     │ 11      │
│ 4   │ 2018   │ 4       │ 4     │ 12      │
│ 5   │ 2019   │ 5       │ 1     │ missing │
│ 6   │ 2019   │ 6       │ 2     │ 13      │
│ 7   │ 2019   │ 7       │ 3     │ 14      │
│ 8   │ 2019   │ 8       │ 4     │ 15      │
│ 9   │ 2017   │ missing │ 1     │ 9       │
│ 10  │ 2017   │ missing │ 2     │ 10      │
│ 11  │ 2017   │ missing │ 3     │ 11      │
│ 12  │ 2017   │ missing │ 4     │ 12      │

The functionality is similar to Stata's reshape long. The columns of the original DataFrame need to be renamed to a specific pattern (in this case Variable_Period) and the function uses Regex to select the columns that match the column names by group (e.g. A or B). Then it melts over the ID variable for each group and renames the columns. Finally the melted DataFrames are merged over ID and Period.

I can try to write some test, documentation and improve the readability of the function before I submit a PR but feel free to use it as a clarification of my feature request in case you are already working on a similar function.

pmarg avatar Jul 28 '19 15:07 pmarg

@nalimilan recently mentioned an intention to work on stackdf/meltdf so maybe he has done some work in this area, so let us wait for his reaction.

Regarding your code @inbounds is not needed I think (this part of code is not performance critical anyway) and I would rather use user specified matcher function taking a column name and splitting it to two parts with some sensible default.

bkamins avatar Jul 28 '19 17:07 bkamins

I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)

bkamins avatar Dec 05 '22 11:12 bkamins