purrr
purrr copied to clipboard
Use of tidyselect syntax to select many columns when pmap'ing
Hi,
I don't think this is currently possible, but I was wondering if there is the possibility to somehow support tidyselect
syntax when using pmap
or its variates within a mutate()
/summarise()
call (i.e. when using a data mask). I run into this issue often enough and I can never quite figure out the cleanest approach to solve the issue.
For example, if I have a large number of columns, I don't want to specify each column name individually. If I want to take the ith value from each one of many columns and do something, it can be tedious to achieve. Manually specifying each column is not ideal.
rowwise()
and c_across()
get close, but c_across()
just tries to combine the row-wise values into a single vector, so they have to be the same type and now you must pass a single vector as a single argument, instead of multiple values as their own arguments.
Maybe there needs to be a list_across()
function similar to c_across()
...
library(tidyverse)
dd <-
matrix(rnorm(10*10), ncol = 10) %>%
as_tibble()
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
dd
#> # A tibble: 10 x 10
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 1.78 -0.280 0.0915 2.30 1.21 0.203 0.704 -1.19 2.03
#> 2 -1.48 1.73 0.405 2.08 0.0974 -1.41 0.498 -1.19 -1.67 -0.149
#> 3 0.390 -0.333 -1.21 1.19 -0.676 -0.224 -1.40 1.50 -2.28 -1.15
#> 4 0.196 -0.983 -2.28 0.207 -0.599 1.99 1.88 0.935 -1.30 0.110
#> 5 0.503 -0.205 0.613 1.26 -0.764 -1.23 -0.573 -0.624 0.579 -1.65
#> 6 -2.33 0.988 0.956 -0.0186 1.41 -0.920 -0.379 -0.867 -1.25 -1.04
#> 7 1.05 -0.0904 0.208 0.0388 0.242 -1.47 0.869 -0.434 1.96 -0.388
#> 8 0.243 -0.692 0.490 -1.17 -0.203 0.187 -0.825 -0.553 2.22 -1.06
#> 9 0.722 -0.0617 0.286 -0.0665 -0.0451 -2.56 -0.717 1.00 -1.16 -0.554
#> 10 1.37 0.968 1.71 0.0363 0.620 0.367 -0.915 0.123 0.767 0.620
dd %>%
mutate(foo = pmap(list(V1, V2, V3, V4, V5), sum)) # too manual, doesn't scale
#> # A tibble: 10 x 11
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 1.78 -0.280 0.0915 2.30 1.21 0.203 0.704 -1.19 2.03
#> 2 -1.48 1.73 0.405 2.08 0.0974 -1.41 0.498 -1.19 -1.67 -0.149
#> 3 0.390 -0.333 -1.21 1.19 -0.676 -0.224 -1.40 1.50 -2.28 -1.15
#> 4 0.196 -0.983 -2.28 0.207 -0.599 1.99 1.88 0.935 -1.30 0.110
#> 5 0.503 -0.205 0.613 1.26 -0.764 -1.23 -0.573 -0.624 0.579 -1.65
#> 6 -2.33 0.988 0.956 -0.0186 1.41 -0.920 -0.379 -0.867 -1.25 -1.04
#> 7 1.05 -0.0904 0.208 0.0388 0.242 -1.47 0.869 -0.434 1.96 -0.388
#> 8 0.243 -0.692 0.490 -1.17 -0.203 0.187 -0.825 -0.553 2.22 -1.06
#> 9 0.722 -0.0617 0.286 -0.0665 -0.0451 -2.56 -0.717 1.00 -1.16 -0.554
#> 10 1.37 0.968 1.71 0.0363 0.620 0.367 -0.915 0.123 0.767 0.620
#> # … with 1 more variable: foo <list>
dd %>%
rowwise() %>%
mutate(foo = sum(c_across(V1:V5))) # too strict; combines to vector
#> # A tibble: 10 x 11
#> # Rowwise:
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.11 1.78 -0.280 0.0915 2.30 1.21 0.203 0.704 -1.19 2.03
#> 2 -1.48 1.73 0.405 2.08 0.0974 -1.41 0.498 -1.19 -1.67 -0.149
#> 3 0.390 -0.333 -1.21 1.19 -0.676 -0.224 -1.40 1.50 -2.28 -1.15
#> 4 0.196 -0.983 -2.28 0.207 -0.599 1.99 1.88 0.935 -1.30 0.110
#> 5 0.503 -0.205 0.613 1.26 -0.764 -1.23 -0.573 -0.624 0.579 -1.65
#> 6 -2.33 0.988 0.956 -0.0186 1.41 -0.920 -0.379 -0.867 -1.25 -1.04
#> 7 1.05 -0.0904 0.208 0.0388 0.242 -1.47 0.869 -0.434 1.96 -0.388
#> 8 0.243 -0.692 0.490 -1.17 -0.203 0.187 -0.825 -0.553 2.22 -1.06
#> 9 0.722 -0.0617 0.286 -0.0665 -0.0451 -2.56 -0.717 1.00 -1.16 -0.554
#> 10 1.37 0.968 1.71 0.0363 0.620 0.367 -0.915 0.123 0.767 0.620
#> # … with 1 more variable: foo <dbl>
dd %>%
mutate(
foo = pmap(list_across(V1:V5), sum) # desired behaviour
)
Created on 2020-11-25 by the reprex package (v0.3.0)
Here is a quick-and-dirty version of list_across()
that has the desired behaviour...happy to clean this up into a PR, if desired.
library(tidyverse)
list_across <- function(cols = everything()) {
cols <- rlang::enquo(cols)
data <- dplyr:::peek_mask()$full_data()
vars <- tidyselect::eval_select(cols, data)
data <- dplyr::select(data, tidyselect::all_of(vars))
as.list(data)
}
mtcars %>%
as_tibble() %>%
transmute(
foo = pmap_chr(list_across(1:2), paste, sep = "_"),
bar = pmap_dbl(list_across(where(is.numeric)), sum)
)
#> # A tibble: 32 x 2
#> foo bar
#> <chr> <dbl>
#> 1 21_6 329.
#> 2 21_6 330.
#> 3 22.8_4 260.
#> 4 21.4_6 426.
#> 5 18.7_8 590.
#> 6 18.1_6 386.
#> 7 14.3_8 657.
#> 8 24.4_4 271.
#> 9 22.8_4 300.
#> 10 19.2_6 350.
#> # … with 22 more rows
@mattwarkentin I'm not sure that the examples you provided makes a strong argument for the need of a list_across
function, those can be effectively done using rowwise
and c_across
.
I wasn't really trying to make a strong argument with these examples - they are just toy examples to accompany the issue as I see it. As I said in the original post, rowwise()
+ c_across()
have limitations in that they are too strict in some situations (i.e. c_across()
tries to concatenate all the values into a single vector and thus must all be the same type).
Where c_across()
combines everything into a single vector, something like list_across()
would support tidyselect
semantics to make it easy to form lists that can be used with the suite of purrr
functionals.
One cannot be sure about the order of the columns selected using tidyselect, if the order of the column changes then the underlying function wouldn't be executed properly as the expected data type would be different. I think that's the reason in pmap
we have to manually type in all the input column names as a list. That's why I thought your examples aren't properly conveying the need for the list_across
, the function sum
takes only numeric types and paste
can take anything.
I see this is still open, so I'd like to offer a +1 for this idea! It would be very useful, e.g. for assembling API requests in a tidy format. Very often a request will combine different data types, and the order of elements in the list does not matter, so @mattwarkentin 's function would be very handy
I can't tell exactly what you want, but I think across()
is enough?
library(dplyr, warn.conflicts = FALSE)
df <- rnorm(10*10) |>
matrix(ncol = 10, dimnames = list(NULL, paste0("V", 1:10))) %>%
as_tibble()
df %>%
mutate(foo = purrr::pmap_dbl(across(V1:V5), sum))
#> # A tibble: 10 × 11
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -1.26 -0.0704 -0.247 -0.177 0.583 -1.64 2.28 -0.466 0.359 -1.88
#> 2 -0.523 1.25 -1.53 -0.378 0.618 -0.709 0.319 0.409 -0.647 1.42
#> 3 1.15 -1.67 -0.133 0.768 -0.652 0.632 0.482 0.939 0.911 -0.713
#> 4 1.80 -0.0164 -0.703 0.504 2.30 -1.06 -0.207 0.444 -0.439 -0.943
#> 5 -2.24 1.22 -1.19 -0.193 0.814 -0.989 -0.193 0.107 0.0478 0.901
#> 6 -0.0350 -1.85 -1.28 0.324 0.725 0.334 1.30 -1.94 0.403 -2.44
#> 7 -0.884 -0.512 -2.07 -0.150 -1.07 1.20 -0.705 -0.850 0.183 -0.158
#> 8 0.378 -0.825 -0.674 0.283 -0.0989 1.05 -0.444 0.850 1.29 -0.189
#> 9 0.243 0.188 0.0643 -0.888 0.870 1.57 -0.780 0.409 0.778 0.173
#> 10 -2.02 -0.552 0.496 -0.899 -0.126 0.849 -1.62 1.11 -1.26 0.620
#> # … with 1 more variable: foo <dbl>
Created on 2022-08-24 by the reprex package (v2.0.1)
In general, we're becoming increasingly confident that NSE (like tidyselect) doesn't belong in purrr.