DataFrames.jl
DataFrames.jl copied to clipboard
Add more keyword arguments to `stack` and `unstack`
On Slack, I wrote that I like the new tidyr::pivot_
functions (https://tidyr.tidyverse.org/articles/pivot.html) because the names make it really obvious what they do, and they have good arguments for indicating what goes where (names_from
, values_from
, names_to
, values_to
). In R you can always name the arguments when you call the function, so I find it easy to read a pivot command later and tell what it does.
pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")
== make the relig_income table longer by taking all columns but 'religion' and stacking them, putting the column names in a new column 'income' and the values in a new column 'count'.
@pdeffebach suggested that I open an issue here to add more keyword arguments to stack
and unstack
to make DataFrames more user-friendly in the same way.
For reference, the current docstring for stack
is here:
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=CategoricalValue{String})
Stack a data frame df, i.e. convert it from wide to long format.
Return the long-format DataFrame with: columns for each of the id_vars, column variable_name
(:value by default) holding the values of the stacked columns (measure_vars), and column
variable_name (:variable by default) a vector holding the name of the corresponding
measure_vars variable.
If view=true then return a stacked view of a data frame (long format). The result is a view
because the columns are special AbstractVectors that return views into the original data
frame.
Arguments
≡≡≡≡≡≡≡≡≡≡≡
• df : the AbstractDataFrame to be stacked
• measure_vars : the columns to be stacked (the measurement variables), as a column
selector (Symbol, string or integer; :, All, Between, Not, a regular expression,
or a vector of Symbols, strings or integers). If neither measure_vars or id_vars
are given, measure_vars defaults to all floating point columns.
• id_vars : the identifier columns that are repeated during stacking, as a column
selector (Symbol, string or integer; :, All, Between, Not, a regular expression,
or a vector of Symbols, strings or integers). Defaults to all variables that are
not measure_vars
• variable_name : the name (Symbol or string) of the new stacked column that shall
hold the names of each of measure_vars
• value_name : the name (Symbol or string) of the new stacked column containing the
values from each of measure_vars
• view : whether the stacked data frame should be a view rather than contain freshly
allocated vectors.
• variable_eltype : determines the element type of column variable_name. By default
a categorical vector of strings is created. If variable_eltype=Symbol it is a
vector of Symbol, and if variable_eltype=String a vector of String is produced.
Examples
≡≡≡≡≡≡≡≡≡≡
d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))
d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)
The only difference is that in R
measure_vars
and id_vars
are keyword arguments. Or, more specifically, in R all positional arguments can be referred to as keyword arguments. This is one of those scenarios where the order of the positional arguments is intuitive. Plus the fact that both measure_vars
and id_vars
are optional confuses me. Are they both optional? You can't specify id_vars
without measure_vars
, for example.
Perhaps we can make measure_vars
and id_vars
keyword arguments. This would be breaking, of course.
Thank you for raising this issue!
Could you please specify what would be the API you propose exactly? Then the process would be that we would weigh if the benefits of changing it are bigger than the cost of being breaking (in general we do not want to be breaking even if something is not ideal - this is hard, but is we want not to loose users this is what we have to do).
However, maybe you can propose a non-breaking alternative (e.g. kwargs only in some special case of positional args - and this would be non breaking then - just allowing a dual form, which is not ideal but a second best solution). But first I need to understand exactly what you propose (again - unfortunately it is crucial to be super precise here as details matter).
The proposed API is to simply make id_vars
and measure_vars
keyword arguments.
I would also like to do something similar with unstack
. Here is the documentation
unstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)
Unstack data frame df, i.e. convert it from long to wide format.
If colkey contains missing values then they will be skipped and a warning
will be printed.
If combination of rowkeys and colkey contains duplicate entries then last
value will be retained and a warning will be printed.
Arguments
≡≡≡≡≡≡≡≡≡≡≡
• df : the AbstractDataFrame to be unstacked
• rowkeys : the columns with a unique key for each row, if not
given, find a key by grouping on anything not a colkey or value.
Can be any column selector (Symbol, string or integer; :, All,
Between, Not, a regular expression, or a vector of Symbols,
strings or integers).
• colkey : the column (Symbol, string or integer) holding the column
names in wide format, defaults to :variable
• value : the value column (Symbol, string or integer), defaults to
:value
• renamecols : a function called on each unique value in colkey
which must return the name of the column to be created (typically
as a string or a Symbol). Duplicate names are not allowed.
Examples
≡≡≡≡≡≡≡≡≡≡
wide = DataFrame(id = 1:12,
a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12))
long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))
Note that there are some differences between the widened results above.
In general, the terminology doesn't match between stack
and unstack
. What is the mapping from id_var
to rowkey
?
There is no need to copy current docstrings I think - they are easy enough to check and easily get outdated (e.g. your docstring for stack
does not match what we currently have on master).
I understand your proposal for stack is to have the following method signatures:
stack(df::AbstractDataFrame;
measure_vars = findall(col -> eltype(col) <: Union{AbstractFloat, Missing}, eachcol(df)),
id_vars = Not(measure_vars),
variable_name::SymbolOrString=:variable,
value_name::SymbolOrString=:value, view::Bool=false,
variable_eltype::Type=String)
stack(df::AbstractDataFrame,
measure_vars, id_vars = Not(measure_vars);
variable_name::SymbolOrString=:variable,
value_name::SymbolOrString=:value, view::Bool=false,
variable_eltype::Type=String)
is this correct?
Also then what would be your proposal for unstack
? Note in particular that we can freely change argument names in unstack
as they are positional so we can name them as we like (the same with stack
) and changing it is non breaking.
The current signatures for unstack
are:
unstack(df::AbstractDataFrame, rowkey::ColumnIndex, colkey::ColumnIndex,
value::ColumnIndex; renamecols::Function=identity)
unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex,
value::ColumnIndex; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey::ColumnIndex, value::ColumnIndex;
renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)
and the question is what signatures you would like to have.
In general it seems that we can resolve this issue in a non breaking way.
We could also consider adding a kwarg to specify the sentinel for missing interactions (now it has to be missing
)
On Slack, I wrote that I like the new
tidyr::pivot_
functions (https://tidyr.tidyverse.org/articles/pivot.html) because the names make it really obvious what they do, and they have good arguments for indicating what goes where (names_from
,values_from
,names_to
,values_to
). In R you can always name the arguments when you call the function, so I find it easy to read a pivot command later and tell what it does.pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")
== make the relig_income table longer by taking all columns but 'religion' and stacking them, putting the column names in a new column 'income' and the values in a new column 'count'.
@pdeffebach suggested that I open an issue here to add more keyword arguments to
stack
andunstack
to make DataFrames more user-friendly in the same way.
#2743 is trying to address this issue, by simplifying the function signature.
I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)