DataFrames.jl
DataFrames.jl copied to clipboard
Stack/Melt over multiple sets of variables
It would be a useful feature if stack
and melt
could be implemented over multiple sets of variables. For example from this:
2×5 DataFrame
│ Row │ ID │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │ 2 │ 2 │
To this:
4×4 DataFrame
│ Row │ ID │ Year │ varA │ varB │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2018 │ 1 │ 1 │
│ 2 │ 1 │ 2019 │ 1 │ 1 │
│ 3 │ 2 │ 2018 │ 2 │ 2 │
│ 4 │ 2 │ 2019 │ 2 │ 2 │
This is implemented in R and Stata via pivot_longer
and reshape long
respectively.
@bkamins suggested the following for a simpler name:
df = DataFrame(ID=1:2, A2018=1:2, A2019=1:2, B2018=1:2, B2019=1:2)
df2 = stack(df, 2:5)
df2.colkey = first.(String.(df2.variable), 1)
df2.Year = parse.(Int, chop.(String.(df2.variable), head=1, tail=0))
unstack(df2, [:ID, :Year], :colkey, :value)
For a more general application, users need to rename the columns with a specific pattern for the function to split the name and the value such as varA_2018
.
One idea is to make use of Regex e.g. :
m=match(r"(?<Name>\w+)_(?<Value>\d+)",colname)
The function would melt
the DataFrame as many times as the specified set of columns over the specified ID (in the above example 2), rename the column of the value to m[1]
(in this case :varA
and :varB
), replace the name of the variable to :Year
and the values of the column :Year
to m[2]
and finally merge the intermediate DataFrames over :ID
and :Year
.
df_longer = melt(df, :ID, [:varA2018, :varA2019], [:varB2018, :varB2019], variable_name =:Year)
The second method is to specify the name of the new variable as an argument in melt
:
df_longer = metl(df, :ID, [:VarA, :VarB],variable_name =:Year)
Using a regular expression the function will melt
the DataFrame using the columns that start with "varA" and "varB" respectively, rename and replace the value as above and then merge the DataFrames over :ID
and :Year
.
I am not very good at writing functions for packages, especially regular expressions, but if you need any help to implement this I would be glad to help the best that I can.
I have written for myself a function that melts over multiple variables but I am not very good at optimizing code.
function pivot_longer(df::AbstractDataFrame,bycol::Symbol,cols::Vector{Symbol};period = :Period)
temp = nothing
df_longer = nothing
m = nothing
@inbounds for col ∈ cols
temp = select(df,Regex("[$(col)]"))
@inbounds for colname ∈ names(temp)
m = match(r"(?<Name>\w+)_(?<Value>\d+)",String(colname))
rename!(temp,colname=>Symbol(m[2]))
end
temp[!,bycol] = df[!,bycol]
temp = melt(temp,bycol)
rename!(temp,:variable => period, :value => Symbol(m[1]))
if df_longer == nothing
df_longer = temp
else
df_longer = join(df_longer, temp, on = [bycol,period], kind = :outer)
end
end
return df_longer
end
Example:
df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12, B_2018=9:12, B_2019 = [missing,13,14,15], ID = [1,2,3,4])
4×6 DataFrame
│ Row │ A_2018 │ A_2019 │ B_2017 │ B_2018 │ B_2019 │ ID │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64⍰ │ Int64 │
├─────┼────────┼────────┼────────┼────────┼─────────┼───────┤
│ 1 │ 1 │ 5 │ 9 │ 9 │ missing │ 1 │
│ 2 │ 2 │ 6 │ 10 │ 10 │ 13 │ 2 │
│ 3 │ 3 │ 7 │ 11 │ 11 │ 14 │ 3 │
│ 4 │ 4 │ 8 │ 12 │ 12 │ 15 │ 4 │
df_longer = pivot_longer(df,:ID,[:A,:B],period = :Year)
12×4 DataFrame
│ Row │ Year │ A │ ID │ B │
│ │ Symbol │ Int64⍰ │ Int64 │ Int64⍰ │
├─────┼────────┼─────────┼───────┼─────────┤
│ 1 │ 2018 │ 1 │ 1 │ 9 │
│ 2 │ 2018 │ 2 │ 2 │ 10 │
│ 3 │ 2018 │ 3 │ 3 │ 11 │
│ 4 │ 2018 │ 4 │ 4 │ 12 │
│ 5 │ 2019 │ 5 │ 1 │ missing │
│ 6 │ 2019 │ 6 │ 2 │ 13 │
│ 7 │ 2019 │ 7 │ 3 │ 14 │
│ 8 │ 2019 │ 8 │ 4 │ 15 │
│ 9 │ 2017 │ missing │ 1 │ 9 │
│ 10 │ 2017 │ missing │ 2 │ 10 │
│ 11 │ 2017 │ missing │ 3 │ 11 │
│ 12 │ 2017 │ missing │ 4 │ 12 │
The functionality is similar to Stata
's reshape long
. The columns of the original DataFrame
need to be renamed to a specific pattern (in this case Variable_Period
) and the function uses Regex
to select the columns that match the column names by group (e.g. A
or B
). Then it melts
over the ID variable for each group and renames the columns. Finally the melt
ed DataFrames
are merged over ID and Period.
I can try to write some test, documentation and improve the readability of the function before I submit a PR but feel free to use it as a clarification of my feature request in case you are already working on a similar function.
@nalimilan recently mentioned an intention to work on stackdf
/meltdf
so maybe he has done some work in this area, so let us wait for his reaction.
Regarding your code @inbounds
is not needed I think (this part of code is not performance critical anyway) and I would rather use user specified matcher function taking a column name and splitting it to two parts with some sensible default.
I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)