dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

Dedicated function for selecting from current data

Open mikkmart opened this issue 2 years ago • 5 comments

Would it be useful to have a dedicated function (say, pick()) to select columns from the current data? Currently, across() with only a .cols argument serves this role.

I would see a dedicated function having at least three advantages:

  1. Nicer syntax for union selections: pick(1, last_col()) vs. across(c(1, last_col())).
  2. Better semantics. across() makes sense when there’s functions to apply, but less so when it’s used just for selecting columns. pick() seems intuitive for only selecting columns.
  3. Reuse existing patterns: across(c(1:2, 4), mean) vs. map_df(pick(1:2, 4), mean). The first requires you to know that across() can select columns and apply a function, latter can re-use existing function application methods.

The last point is particularly important if/when ... is deprecated in across() (#6073), as funtionality would not be identical anymore. For example:

# With no ..., need to use an anonymous function for na.rm
across(c(1, 3:4), ~ mean(., na.rm = TRUE))

# Could be avoided with `pick()`
map_df(pick(1, 3:4), mean, na.rm = TRUE)

I would see the primary uses for this as:

  1. Replace across() in e.g. group_by() selections group_by(across(c(1, 3:5))) vs. group_by(pick(1, 3:5)). Big semantic and syntactic win, IMO.
  2. Passing arguments to functions that take data frame or matrix arguments. For example common questions about taking means or sums over rows in data frames. In my experience people don’t think to apply(across(1:5), 1, f), but apply(pick(1:5), 1, f) might be more intuitive.

I could think of two ways to implement this as a wrapper:

pick <- function(...) {
  across(.cols = c(...))
}

Or:

pick <- function(...) {
  select(cur_data(), ...)
}

Although, particularly with the across() route, it would seem nicer to reverse the dependency and extract the relevant parts from across() intopick() instead.

I appreciate your consideration for this feature request.

mikkmart avatar Mar 03 '22 12:03 mikkmart

We've also discussed letting cur_data() have a cols argument, or ..., for this purpose.

And possibly a groups = TRUE/FALSE argument to replace cur_data_all()

DavisVaughan avatar Mar 03 '22 17:03 DavisVaughan

This function feels conceptually more like across() than cur_*(), so I like pick() as a name. Maybe it could include a .group_vars argument? i.e. something like:

pick <- function(..., .group_vars = TRUE) {
  data <- if (isTRUE(.group_vars)) cur_data_all() else cur_data()
  select(data, ...)
}

I'm not sure if it's worth attempting to optimise this further.

hadley avatar Aug 01 '22 13:08 hadley

We should probably work towards deprecating cur_data() and cur_data_all() in favor of pick() too

DavisVaughan avatar Aug 22 '22 15:08 DavisVaughan

When we implement this I assume we'll also change:

across(.cols = everything(), .fns = NULL, ..., .names = NULL, .unpack = FALSE)
if_any(.cols = everything(), .fns = NULL, ..., .names = NULL)
if_all(.cols = everything(), .fns = NULL, ..., .names = NULL)

To

across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)
if_any(.cols, .fns, ..., .names = NULL)
if_all(.cols, .fns, ..., .names = NULL)

Possible with deprecation, something like:

if (missing(.cols)) {
  lifecycle::deprecate_warn("1.1.0", I("across() without `.cols`"), I("`everything()` to select all columns"))
}

hadley avatar Sep 14 '22 14:09 hadley

Yea and .cols would continue to work, like:

if (missing(.cols)) {
  lifecycle::deprecate_warn("1.1.0", I("across() without `.cols`"), I("`everything()` to select all columns"))
  .cols <- quote(everything())
}

DavisVaughan avatar Sep 14 '22 15:09 DavisVaughan

Also contrary to my earlier example, group_vars absolutely needs to default to FALSE.

hadley avatar Sep 22 '22 14:09 hadley