DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Add more keyword arguments to `stack` and `unstack`

Open swt30 opened this issue 3 years ago • 6 comments

On Slack, I wrote that I like the new tidyr::pivot_ functions (https://tidyr.tidyverse.org/articles/pivot.html) because the names make it really obvious what they do, and they have good arguments for indicating what goes where (names_from, values_from, names_to, values_to). In R you can always name the arguments when you call the function, so I find it easy to read a pivot command later and tell what it does.

pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")

== make the relig_income table longer by taking all columns but 'religion' and stacking them, putting the column names in a new column 'income' and the values in a new column 'count'.

@pdeffebach suggested that I open an issue here to add more keyword arguments to stack and unstack to make DataFrames more user-friendly in the same way.

swt30 avatar Sep 09 '20 16:09 swt30

For reference, the current docstring for stack is here:

stack(df::AbstractDataFrame, [measure_vars], [id_vars];
        variable_name=:variable, value_name=:value,
        view::Bool=false, variable_eltype::Type=CategoricalValue{String})


  Stack a data frame df, i.e. convert it from wide to long format.

  Return the long-format DataFrame with: columns for each of the id_vars, column variable_name
  (:value by default) holding the values of the stacked columns (measure_vars), and column
  variable_name (:variable by default) a vector holding the name of the corresponding
  measure_vars variable.

  If view=true then return a stacked view of a data frame (long format). The result is a view
  because the columns are special AbstractVectors that return views into the original data
  frame.

  Arguments
  ≡≡≡≡≡≡≡≡≡≡≡

    •    df : the AbstractDataFrame to be stacked

    •    measure_vars : the columns to be stacked (the measurement variables), as a column
        selector (Symbol, string or integer; :, All, Between, Not, a regular expression,
        or a vector of Symbols, strings or integers). If neither measure_vars or id_vars
        are given, measure_vars defaults to all floating point columns.

    •    id_vars : the identifier columns that are repeated during stacking, as a column
        selector (Symbol, string or integer; :, All, Between, Not, a regular expression,
        or a vector of Symbols, strings or integers). Defaults to all variables that are
        not measure_vars

    •    variable_name : the name (Symbol or string) of the new stacked column that shall
        hold the names of each of measure_vars

    •    value_name : the name (Symbol or string) of the new stacked column containing the
        values from each of measure_vars

    •    view : whether the stacked data frame should be a view rather than contain freshly
        allocated vectors.

    •    variable_eltype : determines the element type of column variable_name. By default
        a categorical vector of strings is created. If variable_eltype=Symbol it is a
        vector of Symbol, and if variable_eltype=String a vector of String is produced.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  d1 = DataFrame(a = repeat([1:3;], inner = [4]),
                 b = repeat([1:4;], inner = [3]),
                 c = randn(12),
                 d = randn(12),
                 e = map(string, 'a':'l'))
  
  d1s = stack(d1, [:c, :d])
  d1s2 = stack(d1, [:c, :d], [:a])
  d1m = stack(d1, Not([:a, :b, :e]))
  d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)

The only difference is that in R measure_vars and id_vars are keyword arguments. Or, more specifically, in R all positional arguments can be referred to as keyword arguments. This is one of those scenarios where the order of the positional arguments is intuitive. Plus the fact that both measure_vars and id_vars are optional confuses me. Are they both optional? You can't specify id_vars without measure_vars, for example.

Perhaps we can make measure_vars and id_vars keyword arguments. This would be breaking, of course.

pdeffebach avatar Sep 09 '20 16:09 pdeffebach

Thank you for raising this issue!

Could you please specify what would be the API you propose exactly? Then the process would be that we would weigh if the benefits of changing it are bigger than the cost of being breaking (in general we do not want to be breaking even if something is not ideal - this is hard, but is we want not to loose users this is what we have to do).

However, maybe you can propose a non-breaking alternative (e.g. kwargs only in some special case of positional args - and this would be non breaking then - just allowing a dual form, which is not ideal but a second best solution). But first I need to understand exactly what you propose (again - unfortunately it is crucial to be super precise here as details matter).

bkamins avatar Sep 09 '20 18:09 bkamins

The proposed API is to simply make id_vars and measure_vars keyword arguments.

I would also like to do something similar with unstack. Here is the documentation

 unstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity)
  unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity)
  unstack(df::AbstractDataFrame; renamecols::Function=identity)


  Unstack data frame df, i.e. convert it from long to wide format.

  If colkey contains missing values then they will be skipped and a warning
  will be printed.

  If combination of rowkeys and colkey contains duplicate entries then last
  value will be retained and a warning will be printed.

  Arguments
  ≡≡≡≡≡≡≡≡≡≡≡

    •    df : the AbstractDataFrame to be unstacked

    •    rowkeys : the columns with a unique key for each row, if not
        given, find a key by grouping on anything not a colkey or value.
        Can be any column selector (Symbol, string or integer; :, All,
        Between, Not, a regular expression, or a vector of Symbols,
        strings or integers).

    •    colkey : the column (Symbol, string or integer) holding the column
        names in wide format, defaults to :variable

    •    value : the value column (Symbol, string or integer), defaults to
        :value

    •    renamecols : a function called on each unique value in colkey
        which must return the name of the column to be created (typically
        as a string or a Symbol). Duplicate names are not allowed.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  wide = DataFrame(id = 1:12,
                   a  = repeat([1:3;], inner = [4]),
                   b  = repeat([1:4;], inner = [3]),
                   c  = randn(12),
                   d  = randn(12))
  
  long = stack(wide)
  wide0 = unstack(long)
  wide1 = unstack(long, :variable, :value)
  wide2 = unstack(long, :id, :variable, :value)
  wide3 = unstack(long, [:id, :a], :variable, :value)
  wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))


  Note that there are some differences between the widened results above.

In general, the terminology doesn't match between stack and unstack. What is the mapping from id_var to rowkey?

pdeffebach avatar Sep 09 '20 19:09 pdeffebach

There is no need to copy current docstrings I think - they are easy enough to check and easily get outdated (e.g. your docstring for stack does not match what we currently have on master).

I understand your proposal for stack is to have the following method signatures:

stack(df::AbstractDataFrame;
      measure_vars = findall(col -> eltype(col) <: Union{AbstractFloat, Missing}, eachcol(df)),
      id_vars = Not(measure_vars),
      variable_name::SymbolOrString=:variable,
      value_name::SymbolOrString=:value, view::Bool=false,
      variable_eltype::Type=String)
stack(df::AbstractDataFrame,
      measure_vars, id_vars = Not(measure_vars);
      variable_name::SymbolOrString=:variable,
      value_name::SymbolOrString=:value, view::Bool=false,
      variable_eltype::Type=String)

is this correct?

Also then what would be your proposal for unstack? Note in particular that we can freely change argument names in unstack as they are positional so we can name them as we like (the same with stack) and changing it is non breaking.

The current signatures for unstack are:

unstack(df::AbstractDataFrame, rowkey::ColumnIndex, colkey::ColumnIndex,
    value::ColumnIndex; renamecols::Function=identity)
unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex,
    value::ColumnIndex; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey::ColumnIndex, value::ColumnIndex;
    renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)

and the question is what signatures you would like to have.

In general it seems that we can resolve this issue in a non breaking way.

bkamins avatar Sep 09 '20 20:09 bkamins

We could also consider adding a kwarg to specify the sentinel for missing interactions (now it has to be missing)

bkamins avatar Jan 26 '21 08:01 bkamins

On Slack, I wrote that I like the new tidyr::pivot_ functions (https://tidyr.tidyverse.org/articles/pivot.html) because the names make it really obvious what they do, and they have good arguments for indicating what goes where (names_from, values_from, names_to, values_to). In R you can always name the arguments when you call the function, so I find it easy to read a pivot command later and tell what it does.

pivot_longer(relig_income, !religion, names_to = "income", values_to = "count")

== make the relig_income table longer by taking all columns but 'religion' and stacking them, putting the column names in a new column 'income' and the values in a new column 'count'.

@pdeffebach suggested that I open an issue here to add more keyword arguments to stack and unstack to make DataFrames more user-friendly in the same way.

#2743 is trying to address this issue, by simplifying the function signature.

sl-solution avatar May 01 '21 23:05 sl-solution

I am closing this in favor of https://github.com/JuliaData/DataFrames.jl/issues/3237 (to have a single place to discuss all related issues)

bkamins avatar Dec 05 '22 11:12 bkamins