dplyr
dplyr copied to clipboard
Dedicated function for selecting from current data
Would it be useful to have a dedicated function (say, pick()
) to select columns from the current data? Currently, across()
with only a .cols
argument serves this role.
I would see a dedicated function having at least three advantages:
- Nicer syntax for union selections:
pick(1, last_col())
vs.across(c(1, last_col()))
. - Better semantics.
across()
makes sense when there’s functions to apply, but less so when it’s used just for selecting columns.pick()
seems intuitive for only selecting columns. - Reuse existing patterns:
across(c(1:2, 4), mean)
vs.map_df(pick(1:2, 4), mean)
. The first requires you to know thatacross()
can select columns and apply a function, latter can re-use existing function application methods.
The last point is particularly important if/when ...
is deprecated in across()
(#6073), as funtionality would not be identical anymore. For example:
# With no ..., need to use an anonymous function for na.rm
across(c(1, 3:4), ~ mean(., na.rm = TRUE))
# Could be avoided with `pick()`
map_df(pick(1, 3:4), mean, na.rm = TRUE)
I would see the primary uses for this as:
- Replace
across()
in e.g.group_by()
selectionsgroup_by(across(c(1, 3:5)))
vs.group_by(pick(1, 3:5))
. Big semantic and syntactic win, IMO. - Passing arguments to functions that take data frame or matrix arguments. For example common questions about taking means or sums over rows in data frames. In my experience people don’t think to
apply(across(1:5), 1, f)
, butapply(pick(1:5), 1, f)
might be more intuitive.
I could think of two ways to implement this as a wrapper:
pick <- function(...) {
across(.cols = c(...))
}
Or:
pick <- function(...) {
select(cur_data(), ...)
}
Although, particularly with the across()
route, it would seem nicer to reverse the dependency and extract the relevant parts from across()
intopick()
instead.
I appreciate your consideration for this feature request.
We've also discussed letting cur_data()
have a cols
argument, or ...
, for this purpose.
And possibly a groups = TRUE/FALSE
argument to replace cur_data_all()
This function feels conceptually more like across()
than cur_*()
, so I like pick()
as a name. Maybe it could include a .group_vars
argument? i.e. something like:
pick <- function(..., .group_vars = TRUE) {
data <- if (isTRUE(.group_vars)) cur_data_all() else cur_data()
select(data, ...)
}
I'm not sure if it's worth attempting to optimise this further.
We should probably work towards deprecating cur_data()
and cur_data_all()
in favor of pick()
too
When we implement this I assume we'll also change:
across(.cols = everything(), .fns = NULL, ..., .names = NULL, .unpack = FALSE)
if_any(.cols = everything(), .fns = NULL, ..., .names = NULL)
if_all(.cols = everything(), .fns = NULL, ..., .names = NULL)
To
across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)
if_any(.cols, .fns, ..., .names = NULL)
if_all(.cols, .fns, ..., .names = NULL)
Possible with deprecation, something like:
if (missing(.cols)) {
lifecycle::deprecate_warn("1.1.0", I("across() without `.cols`"), I("`everything()` to select all columns"))
}
Yea and .cols
would continue to work, like:
if (missing(.cols)) {
lifecycle::deprecate_warn("1.1.0", I("across() without `.cols`"), I("`everything()` to select all columns"))
.cols <- quote(everything())
}
Also contrary to my earlier example, group_vars
absolutely needs to default to FALSE
.