purrr icon indicating copy to clipboard operation
purrr copied to clipboard

Use of tidyselect syntax to select many columns when pmap'ing

Open mattwarkentin opened this issue 4 years ago • 5 comments

Hi,

I don't think this is currently possible, but I was wondering if there is the possibility to somehow support tidyselect syntax when using pmap or its variates within a mutate()/summarise() call (i.e. when using a data mask). I run into this issue often enough and I can never quite figure out the cleanest approach to solve the issue.

For example, if I have a large number of columns, I don't want to specify each column name individually. If I want to take the ith value from each one of many columns and do something, it can be tedious to achieve. Manually specifying each column is not ideal.

rowwise() and c_across() get close, but c_across() just tries to combine the row-wise values into a single vector, so they have to be the same type and now you must pass a single vector as a single argument, instead of multiple values as their own arguments.

Maybe there needs to be a list_across() function similar to c_across()...

library(tidyverse)

dd <-
  matrix(rnorm(10*10), ncol = 10) %>% 
  as_tibble()
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
dd
#> # A tibble: 10 x 10
#>        V1      V2     V3      V4      V5     V6     V7     V8     V9    V10
#>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1  1.11   1.78   -0.280  0.0915  2.30    1.21   0.203  0.704 -1.19   2.03 
#>  2 -1.48   1.73    0.405  2.08    0.0974 -1.41   0.498 -1.19  -1.67  -0.149
#>  3  0.390 -0.333  -1.21   1.19   -0.676  -0.224 -1.40   1.50  -2.28  -1.15 
#>  4  0.196 -0.983  -2.28   0.207  -0.599   1.99   1.88   0.935 -1.30   0.110
#>  5  0.503 -0.205   0.613  1.26   -0.764  -1.23  -0.573 -0.624  0.579 -1.65 
#>  6 -2.33   0.988   0.956 -0.0186  1.41   -0.920 -0.379 -0.867 -1.25  -1.04 
#>  7  1.05  -0.0904  0.208  0.0388  0.242  -1.47   0.869 -0.434  1.96  -0.388
#>  8  0.243 -0.692   0.490 -1.17   -0.203   0.187 -0.825 -0.553  2.22  -1.06 
#>  9  0.722 -0.0617  0.286 -0.0665 -0.0451 -2.56  -0.717  1.00  -1.16  -0.554
#> 10  1.37   0.968   1.71   0.0363  0.620   0.367 -0.915  0.123  0.767  0.620

dd %>% 
  mutate(foo = pmap(list(V1, V2, V3, V4, V5), sum)) # too manual, doesn't scale
#> # A tibble: 10 x 11
#>        V1      V2     V3      V4      V5     V6     V7     V8     V9    V10
#>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1  1.11   1.78   -0.280  0.0915  2.30    1.21   0.203  0.704 -1.19   2.03 
#>  2 -1.48   1.73    0.405  2.08    0.0974 -1.41   0.498 -1.19  -1.67  -0.149
#>  3  0.390 -0.333  -1.21   1.19   -0.676  -0.224 -1.40   1.50  -2.28  -1.15 
#>  4  0.196 -0.983  -2.28   0.207  -0.599   1.99   1.88   0.935 -1.30   0.110
#>  5  0.503 -0.205   0.613  1.26   -0.764  -1.23  -0.573 -0.624  0.579 -1.65 
#>  6 -2.33   0.988   0.956 -0.0186  1.41   -0.920 -0.379 -0.867 -1.25  -1.04 
#>  7  1.05  -0.0904  0.208  0.0388  0.242  -1.47   0.869 -0.434  1.96  -0.388
#>  8  0.243 -0.692   0.490 -1.17   -0.203   0.187 -0.825 -0.553  2.22  -1.06 
#>  9  0.722 -0.0617  0.286 -0.0665 -0.0451 -2.56  -0.717  1.00  -1.16  -0.554
#> 10  1.37   0.968   1.71   0.0363  0.620   0.367 -0.915  0.123  0.767  0.620
#> # … with 1 more variable: foo <list>

dd %>% 
  rowwise() %>% 
  mutate(foo = sum(c_across(V1:V5))) # too strict; combines to vector
#> # A tibble: 10 x 11
#> # Rowwise: 
#>        V1      V2     V3      V4      V5     V6     V7     V8     V9    V10
#>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1  1.11   1.78   -0.280  0.0915  2.30    1.21   0.203  0.704 -1.19   2.03 
#>  2 -1.48   1.73    0.405  2.08    0.0974 -1.41   0.498 -1.19  -1.67  -0.149
#>  3  0.390 -0.333  -1.21   1.19   -0.676  -0.224 -1.40   1.50  -2.28  -1.15 
#>  4  0.196 -0.983  -2.28   0.207  -0.599   1.99   1.88   0.935 -1.30   0.110
#>  5  0.503 -0.205   0.613  1.26   -0.764  -1.23  -0.573 -0.624  0.579 -1.65 
#>  6 -2.33   0.988   0.956 -0.0186  1.41   -0.920 -0.379 -0.867 -1.25  -1.04 
#>  7  1.05  -0.0904  0.208  0.0388  0.242  -1.47   0.869 -0.434  1.96  -0.388
#>  8  0.243 -0.692   0.490 -1.17   -0.203   0.187 -0.825 -0.553  2.22  -1.06 
#>  9  0.722 -0.0617  0.286 -0.0665 -0.0451 -2.56  -0.717  1.00  -1.16  -0.554
#> 10  1.37   0.968   1.71   0.0363  0.620   0.367 -0.915  0.123  0.767  0.620
#> # … with 1 more variable: foo <dbl>

dd %>% 
  mutate(
    foo = pmap(list_across(V1:V5), sum) # desired behaviour
  )

Created on 2020-11-25 by the reprex package (v0.3.0)

mattwarkentin avatar Nov 25 '20 19:11 mattwarkentin

Here is a quick-and-dirty version of list_across() that has the desired behaviour...happy to clean this up into a PR, if desired.

library(tidyverse)

list_across <- function(cols = everything()) {
  cols <- rlang::enquo(cols)
  data <- dplyr:::peek_mask()$full_data()
  vars <- tidyselect::eval_select(cols, data)
  data <- dplyr::select(data, tidyselect::all_of(vars))
  as.list(data)
}

mtcars %>% 
  as_tibble() %>% 
  transmute(
    foo = pmap_chr(list_across(1:2), paste, sep = "_"),
    bar = pmap_dbl(list_across(where(is.numeric)), sum)
  )
#> # A tibble: 32 x 2
#>    foo      bar
#>    <chr>  <dbl>
#>  1 21_6    329.
#>  2 21_6    330.
#>  3 22.8_4  260.
#>  4 21.4_6  426.
#>  5 18.7_8  590.
#>  6 18.1_6  386.
#>  7 14.3_8  657.
#>  8 24.4_4  271.
#>  9 22.8_4  300.
#> 10 19.2_6  350.
#> # … with 22 more rows

mattwarkentin avatar Nov 25 '20 20:11 mattwarkentin

@mattwarkentin I'm not sure that the examples you provided makes a strong argument for the need of a list_across function, those can be effectively done using rowwise and c_across.

msunij avatar Jul 27 '21 16:07 msunij

I wasn't really trying to make a strong argument with these examples - they are just toy examples to accompany the issue as I see it. As I said in the original post, rowwise() + c_across() have limitations in that they are too strict in some situations (i.e. c_across() tries to concatenate all the values into a single vector and thus must all be the same type).

Where c_across() combines everything into a single vector, something like list_across() would support tidyselect semantics to make it easy to form lists that can be used with the suite of purrr functionals.

mattwarkentin avatar Jul 27 '21 16:07 mattwarkentin

One cannot be sure about the order of the columns selected using tidyselect, if the order of the column changes then the underlying function wouldn't be executed properly as the expected data type would be different. I think that's the reason in pmap we have to manually type in all the input column names as a list. That's why I thought your examples aren't properly conveying the need for the list_across, the function sum takes only numeric types and paste can take anything.

msunij avatar Jul 28 '21 08:07 msunij

I see this is still open, so I'd like to offer a +1 for this idea! It would be very useful, e.g. for assembling API requests in a tidy format. Very often a request will combine different data types, and the order of elements in the list does not matter, so @mattwarkentin 's function would be very handy

aammd avatar Feb 28 '22 16:02 aammd

I can't tell exactly what you want, but I think across() is enough?

library(dplyr, warn.conflicts = FALSE)

df <- rnorm(10*10) |> 
  matrix(ncol = 10, dimnames = list(NULL, paste0("V", 1:10))) %>% 
  as_tibble()

df %>% 
  mutate(foo = purrr::pmap_dbl(across(V1:V5), sum))
#> # A tibble: 10 × 11
#>         V1      V2      V3     V4      V5     V6     V7     V8      V9    V10
#>      <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
#>  1 -1.26   -0.0704 -0.247  -0.177  0.583  -1.64   2.28  -0.466  0.359  -1.88 
#>  2 -0.523   1.25   -1.53   -0.378  0.618  -0.709  0.319  0.409 -0.647   1.42 
#>  3  1.15   -1.67   -0.133   0.768 -0.652   0.632  0.482  0.939  0.911  -0.713
#>  4  1.80   -0.0164 -0.703   0.504  2.30   -1.06  -0.207  0.444 -0.439  -0.943
#>  5 -2.24    1.22   -1.19   -0.193  0.814  -0.989 -0.193  0.107  0.0478  0.901
#>  6 -0.0350 -1.85   -1.28    0.324  0.725   0.334  1.30  -1.94   0.403  -2.44 
#>  7 -0.884  -0.512  -2.07   -0.150 -1.07    1.20  -0.705 -0.850  0.183  -0.158
#>  8  0.378  -0.825  -0.674   0.283 -0.0989  1.05  -0.444  0.850  1.29   -0.189
#>  9  0.243   0.188   0.0643 -0.888  0.870   1.57  -0.780  0.409  0.778   0.173
#> 10 -2.02   -0.552   0.496  -0.899 -0.126   0.849 -1.62   1.11  -1.26    0.620
#> # … with 1 more variable: foo <dbl>

Created on 2022-08-24 by the reprex package (v2.0.1)

In general, we're becoming increasingly confident that NSE (like tidyselect) doesn't belong in purrr.

hadley avatar Aug 24 '22 08:08 hadley