dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

cur_column() requires extra parentheses to work inside if_any() and if_all()

Open duyjpr opened this issue 2 years ago • 1 comments

Say I have a list filter functions that I want to run against a data frame. A natural way to do this would be to use filter(if_all(...)). This gives an error when you invoke cur_column() inside if_all(...).

library(dplyr)
criteria <- list(
    Species = \(.x) paste(.x) < "v",
    Petal.Width = \(.x) .x < .3
)

# Error: cur_column() demands to be called inside across()
iris |>
    filter(
        if_all(
            names(criteria),
            ~ criteria[[cur_column()]](.x)
        )
    )
#> Error in `filter()`:
#> ! Problem while computing `..1 = criteria[[cur_column()]](Species) &
#>   ...`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.

However, if I simply add extra parentheses around the if_all(...), everything works:

# OK: extra parens around (if_all(...cur_column()...))
iris |>
    filter(
        (if_all(
            names(criteria),
            ~ criteria[[cur_column()]](.x)
        ))
    ) |>
    with(table(Species)) # less verbose output
#> Species
#>     setosa versicolor  virginica 
#>         34          0          0

If I use across(...) instead of if_all(...), it works without extra parentheses but this is deprecated. Fixing the warning was how I discovered this issue.


Full reprex.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

criteria <- list(
    Species = \(.x) paste(.x) < "v",
    Petal.Width = \(.x) .x < .3
)

# Error: cur_column() demands to be called inside across()
iris |>
    filter(
        if_all(
            names(criteria),
            ~ criteria[[cur_column()]](.x)
        )
    )
#> Error in `filter()`:
#> ! Problem while computing `..1 = criteria[[cur_column()]](Species) &
#>   ...`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.

# OK: extra parens around (if_all(...cur_column()...))
iris |>
    filter(
        (if_all(
            names(criteria),
            ~ criteria[[cur_column()]](.x)
        ))
    ) |>
    with(table(Species)) # less verbose output
#> Species
#>     setosa versicolor  virginica 
#>         34          0          0

# OK but deprecated
iris |>
    filter(
        across(
            names(criteria),
            ~ criteria[[cur_column()]](.x)
        )
    ) |>
    with(table(Species)) # less verbose output
#> Warning: Using `across()` in `filter()` is deprecated, use `if_any()` or
#> `if_all()`.
#> Species
#>     setosa versicolor  virginica 
#>         34          0          0

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
#> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
#> [5] LC_TIME=English_Australia.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.7.0      compiler_4.1.2    highr_0.9         R.methodsS3_1.8.1
#>  [5] R.utils_2.11.0    tools_4.1.2       digest_0.6.29     evaluate_0.15    
#>  [9] lifecycle_1.0.1   tibble_3.1.6      R.cache_0.15.0    pkgconfig_2.0.3  
#> [13] rlang_1.0.1       reprex_2.0.1      DBI_1.1.2         cli_3.1.1        
#> [17] rstudioapi_0.13   yaml_2.2.2        xfun_0.29         fastmap_1.1.0    
#> [21] withr_2.4.3       styler_1.6.2      stringr_1.4.0     knitr_1.37       
#> [25] generics_0.1.2    fs_1.5.2          vctrs_0.3.8       tidyselect_1.1.1 
#> [29] glue_1.6.1        R6_2.5.1          fansi_1.0.2       rmarkdown_2.11   
#> [33] purrr_0.3.4       magrittr_2.0.2    backports_1.4.1   ellipsis_0.3.2   
#> [37] htmltools_0.5.2   assertthat_0.2.1  utf8_1.2.2        stringi_1.7.6    
#> [41] crayon_1.5.0      R.oo_1.24.0

Created on 2022-02-25 by the reprex package (v2.0.1)

duyjpr avatar Feb 24 '22 13:02 duyjpr

Somewhat simpler reprex:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  x = c("a", "b", "x"),
  y = c("y", "z", "a")
)

df |> filter(if_any(everything(), ~ .x == cur_column()))
#> Error in `filter()`:
#> ! Problem while computing `..1 = x == cur_column() | y == cur_column()`.
#> Caused by error in `cur_column()`:
#> ! Must be used inside `across()`.

Created on 2022-04-15 by the reprex package (v2.0.1)

hadley avatar Apr 15 '22 23:04 hadley

I think the problem is that:

df |> filter(if_any(everything(), ~ .x == cur_column()))

tries to expand the if_any() call to:

df |> filter(x == cur_column() | y == cur_column())

But that is wrong for two reasons:

  • filter_eval()doesn't set a context for evaluating the cur_*() functions in, i.e. like context_poke("column", var)
  • Even if it did, there is no way to set up a "correct" context for this expanded expression, because x == cur_column() needs to use context_poke("column", "x") and y == cur_column() needs context_poke("column", "y") but you can't have both active at the same time in the same expression

It works correctly with the extra parenthesis wrapping because that prevents the expansion. The unexpanded if_any() call eventually passes off to across(), and across() evaluates x == cur_column() and y == cur_column() separately, setting the appropriate contexts before it evaluates them, and then merges the results with |.

I'm thinking it might not ever be possible to expand filter expressions correctly.

But I wonder if that is an issue, considering that filter() is typically not used on grouped data frames (so it suffers from less performance issues)?

DavisVaughan avatar Aug 19 '22 17:08 DavisVaughan

We might need some advice from @lionel- here

I guess it is a little unfortunate that cur_column() is the only helper where this issue would arise, as its the only one that is across specific

DavisVaughan avatar Aug 19 '22 17:08 DavisVaughan

One possible option would be to substitute cur_column() during expansion time, i.e. replace it with "x", "y". Probably best reserved if we can't find another solution since you still might use it inside a function where we wouldn't see to substitute it.

hadley avatar Aug 20 '22 12:08 hadley

But I wonder if that is an issue, considering that filter() is typically not used on grouped data frames (so it suffers from less performance issues)?

The expansion is also useful to get cross-backend support for free and should be helpful with hybrid eval 2. Expansion has nice properties but it's a contagious design, all features need to be expandable. If expansion is important (I think it is), then I don't see any other solution than substitution (i.e. expansion).

Substitution is always going to have annoying and probably unforeseen edge cases though (should we substitute past tilde/function boundaries, should we care about recursive cases where users somehow call dplyr again within the expression, ...). To limit failing edge cases to an acceptable level it might be enough add a bail out mechanism that stops the expansion and falls back to a pure evaluation backend if anything unusual like a lambda is detected.

Edit: In this case it would be nice if the expansion failure could be signalled to backends so they can take appropriate measures like failing with an informative error message.

lionel- avatar Aug 22 '22 05:08 lionel-

It's worth noting that the original code is equivalent to:

iris |>
    filter(as.character(Species) < "v" | Petal.Width < .3) 

And that if_any()/if_all() was design with the goal of applying the same expression to each column. If you want to supply a different expression to each column, you use filter() directly.

So I think we've convinced ourselves that the current behaviour is correct.

hadley avatar Sep 07 '22 14:09 hadley