dplyr
dplyr copied to clipboard
cur_column() requires extra parentheses to work inside if_any() and if_all()
Say I have a list filter functions that I want to run against a data frame. A natural way to do this would be to use filter(if_all(...))
. This gives an error when you invoke cur_column()
inside if_all(...)
.
library(dplyr)
criteria <- list(
Species = \(.x) paste(.x) < "v",
Petal.Width = \(.x) .x < .3
)
# Error: cur_column() demands to be called inside across()
iris |>
filter(
if_all(
names(criteria),
~ criteria[[cur_column()]](.x)
)
)
#> Error in `filter()`:
#> ! Problem while computing `..1 = criteria[[cur_column()]](Species) &
#> ...`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.
However, if I simply add extra parentheses around the if_all(...)
, everything works:
# OK: extra parens around (if_all(...cur_column()...))
iris |>
filter(
(if_all(
names(criteria),
~ criteria[[cur_column()]](.x)
))
) |>
with(table(Species)) # less verbose output
#> Species
#> setosa versicolor virginica
#> 34 0 0
If I use across(...)
instead of if_all(...)
, it works without extra parentheses but this is deprecated. Fixing the warning was how I discovered this issue.
Full reprex.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
criteria <- list(
Species = \(.x) paste(.x) < "v",
Petal.Width = \(.x) .x < .3
)
# Error: cur_column() demands to be called inside across()
iris |>
filter(
if_all(
names(criteria),
~ criteria[[cur_column()]](.x)
)
)
#> Error in `filter()`:
#> ! Problem while computing `..1 = criteria[[cur_column()]](Species) &
#> ...`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.
# OK: extra parens around (if_all(...cur_column()...))
iris |>
filter(
(if_all(
names(criteria),
~ criteria[[cur_column()]](.x)
))
) |>
with(table(Species)) # less verbose output
#> Species
#> setosa versicolor virginica
#> 34 0 0
# OK but deprecated
iris |>
filter(
across(
names(criteria),
~ criteria[[cur_column()]](.x)
)
) |>
with(table(Species)) # less verbose output
#> Warning: Using `across()` in `filter()` is deprecated, use `if_any()` or
#> `if_all()`.
#> Species
#> setosa versicolor virginica
#> 34 0 0
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
#> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_Australia.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.8
#>
#> loaded via a namespace (and not attached):
#> [1] pillar_1.7.0 compiler_4.1.2 highr_0.9 R.methodsS3_1.8.1
#> [5] R.utils_2.11.0 tools_4.1.2 digest_0.6.29 evaluate_0.15
#> [9] lifecycle_1.0.1 tibble_3.1.6 R.cache_0.15.0 pkgconfig_2.0.3
#> [13] rlang_1.0.1 reprex_2.0.1 DBI_1.1.2 cli_3.1.1
#> [17] rstudioapi_0.13 yaml_2.2.2 xfun_0.29 fastmap_1.1.0
#> [21] withr_2.4.3 styler_1.6.2 stringr_1.4.0 knitr_1.37
#> [25] generics_0.1.2 fs_1.5.2 vctrs_0.3.8 tidyselect_1.1.1
#> [29] glue_1.6.1 R6_2.5.1 fansi_1.0.2 rmarkdown_2.11
#> [33] purrr_0.3.4 magrittr_2.0.2 backports_1.4.1 ellipsis_0.3.2
#> [37] htmltools_0.5.2 assertthat_0.2.1 utf8_1.2.2 stringi_1.7.6
#> [41] crayon_1.5.0 R.oo_1.24.0
Created on 2022-02-25 by the reprex package (v2.0.1)
Somewhat simpler reprex:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
x = c("a", "b", "x"),
y = c("y", "z", "a")
)
df |> filter(if_any(everything(), ~ .x == cur_column()))
#> Error in `filter()`:
#> ! Problem while computing `..1 = x == cur_column() | y == cur_column()`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.
Created on 2022-04-15 by the reprex package (v2.0.1)
I think the problem is that:
df |> filter(if_any(everything(), ~ .x == cur_column()))
tries to expand the if_any()
call to:
df |> filter(x == cur_column() | y == cur_column())
But that is wrong for two reasons:
-
filter_eval()
doesn't set a context for evaluating thecur_*()
functions in, i.e. likecontext_poke("column", var)
- Even if it did, there is no way to set up a "correct" context for this expanded expression, because
x == cur_column()
needs to usecontext_poke("column", "x")
andy == cur_column()
needscontext_poke("column", "y")
but you can't have both active at the same time in the same expression
It works correctly with the extra parenthesis wrapping because that prevents the expansion. The unexpanded if_any()
call eventually passes off to across()
, and across()
evaluates x == cur_column()
and y == cur_column()
separately, setting the appropriate contexts before it evaluates them, and then merges the results with |
.
I'm thinking it might not ever be possible to expand filter expressions correctly.
But I wonder if that is an issue, considering that filter()
is typically not used on grouped data frames (so it suffers from less performance issues)?
We might need some advice from @lionel- here
I guess it is a little unfortunate that cur_column()
is the only helper where this issue would arise, as its the only one that is across specific
One possible option would be to substitute cur_column()
during expansion time, i.e. replace it with "x"
, "y"
. Probably best reserved if we can't find another solution since you still might use it inside a function where we wouldn't see to substitute it.
But I wonder if that is an issue, considering that filter() is typically not used on grouped data frames (so it suffers from less performance issues)?
The expansion is also useful to get cross-backend support for free and should be helpful with hybrid eval 2. Expansion has nice properties but it's a contagious design, all features need to be expandable. If expansion is important (I think it is), then I don't see any other solution than substitution (i.e. expansion).
Substitution is always going to have annoying and probably unforeseen edge cases though (should we substitute past tilde/function boundaries, should we care about recursive cases where users somehow call dplyr again within the expression, ...). To limit failing edge cases to an acceptable level it might be enough add a bail out mechanism that stops the expansion and falls back to a pure evaluation backend if anything unusual like a lambda is detected.
Edit: In this case it would be nice if the expansion failure could be signalled to backends so they can take appropriate measures like failing with an informative error message.
It's worth noting that the original code is equivalent to:
iris |>
filter(as.character(Species) < "v" | Petal.Width < .3)
And that if_any()
/if_all()
was design with the goal of applying the same expression to each column. If you want to supply a different expression to each column, you use filter()
directly.
So I think we've convinced ourselves that the current behaviour is correct.