dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

The case for `filter(.missing = NULL, .how = c("keep", "drop"))`

Open DavisVaughan opened this issue 2 years ago • 9 comments

There have been quite a few requests in the past for an "anti filter", i.e. I want to specify a set of conditions that determine which rows to drop. Additionally, it has traditionally been somewhat difficult to explain that filter() is about specifying rows to keep; that isn't really explained clearly in the verb name. Also, we've also seen in the past that it is mildly confusing that select() is about columns and filter() is about rows, again, there isn't anything in the verb names to describe the difference.


One thing we could consider doing is to add two new very explicit verbs:

  • keep_rows(data, ..., by = , missing = )
  • drop_rows(data, ..., by = , missing = )

Where keep_rows() is equivalent to filter(), and drop_rows() is the opposite.

To be very clear, filter() would never disappear. We would, however, consider superseding it in favor of these if they prove to be successful, which really only means we'd start using them in docs and workshops instead of filter(). We'd even consider not even superseding filter(), which many people find scary. Instead we'd just be aliasing keep_rows() as filter().

The biggest annoyance when writing a "drop" style expression with filter() is that you first have to write a "keep" expression and then painfully invert it. i.e.:

"drop rows from df where a and b and c are TRUE"

library(dplyr)

df <- tibble(
  id = c(1, 2, 3),
  a = c(TRUE, NA, TRUE),
  b = c(FALSE, TRUE, TRUE),
  c = c(TRUE, TRUE, TRUE)
)
df
#> # A tibble: 3 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 
#> 3     3 TRUE  TRUE  TRUE

# "keep rows where a and b and c are TRUE"
# this nicely drops NAs because they don't match our specified criteria
filter(df, a, b, c)
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     3 TRUE  TRUE  TRUE

# "drop rows where a and b and c are TRUE"
# this AWFULLY drops NAs because of how filter() works!
# NA doesn't match our criteria so shouldnt be seen as something to drop
filter(df, !(a & b & c))
#> # A tibble: 1 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE

Note that even the seemingly correct "drop" expression is actually wrong when it comes to handling missing values. It is fairly hard to get this right.

The drop_rows() version would be:

df %>% drop_rows(a, b, c)
#> # A tibble: 2 × 4
#>      id a     b     c    
#>   <dbl> <lgl> <lgl> <lgl>
#> 1     1 TRUE  FALSE TRUE 
#> 2     2 NA    TRUE  TRUE 

Where NA isn't considered something you "drop" by default, but would be if missing was tweaked to whatever we decide means "treat a missing value like TRUE".


A few other notes:

  • missing is from https://github.com/tidyverse/dplyr/issues/6560 and controls how missing values are treated. By default, both functions would treat an NA as FALSE (i.e. missing values are never kept or dropped), but could be made to treat them as TRUE or an error. Though I don't think missing = c("keep", "drop", "error") works uniformly for both verbs so we'd need to think of another parameterization.
  • Both functions would support if_all() and if_any(), which I think form nice natural sentences. "drop rows if any are NA" sounds pretty good for drop_rows(df, if_any(c(a, b), is.na)). That is like tidyr::drop_na().
  • Neither would support across(), which we have been deprecating from filter() for a little while now.
  • Both functions would combine multiple conditions using &, as that is typically the natural way to combine multiple conditions and you can always get | behavior with either an explicit | or by using multiple calls to the function. i.e. df %>% drop_rows(x > 5 | y > 6) is the same as df %>% drop_rows(x > 5) %>% drop_rows(y > 6) (and you can't do that split trick with &). if_any() can also work for | when you need to apply the same function to multiple columns.
  • Both functions would support by

Some issues and questions related to this:

  • https://github.com/tidyverse/dplyr/issues/6888
  • https://github.com/tidyverse/dplyr/issues/1527
  • https://github.com/tidyverse/dplyr/issues/741
  • https://github.com/tidyverse/dplyr/issues/1797
  • https://stackoverflow.com/questions/45661377/delete-rows-based-on-multiple-conditions-with-dplyr

DavisVaughan avatar Jul 24 '23 13:07 DavisVaughan

We can also lean into the ambiguity of filter() and think about adding an argument like filter(.for = c(TRUE, FALSE, NA)) (which may also partly (completely?) handle #6560)

Or maybe

df %>% filter(.for = c(TRUE, NA))

Although that means the equivalent of df %>% drop_rows(a, b, c) is df %>% filter(a, b, c, .for = c(FALSE, NA)) which I think is still too hard to remember

The .for argument means you can technically do all 8 possible combinations of things to "keep", but I prove in the table in the comment below that that isn't actually useful.

DavisVaughan avatar Jul 24 '23 17:07 DavisVaughan

A key idea while thinking about this is realizing that both keep_rows(df, a, b) and drop_rows(df, a, b) would apply the exact same internal algorithm to combine a, b into a single logical result vector.

The common behavior between them is that both of them treat NA as FALSE by default. But:

  • For keep_rows() that means NA is dropped
  • For drop_rows() that means NA is kept

The only algorithmic difference between them is whether or not we use a ! at the very end before slicing.

a b result
T T T
F F F
NA NA F
T F F
F T F
T NA F
NA T F
F NA F
NA F F

Then:

keep_rows(df, a, b) == df[result,]
drop_rows(df, a, b) == df[!result,]

This gives a nice theoretical result where the following rbind recreates df (ignoring row ordering) even when NAs are in the mix:

rbind(
  keep_rows(df, a, b),
  drop_rows(df, a, b)
)

The theoretical missing argument would just mean that if missing = "keep" then NA is treated as TRUE instead.


The full grid of possible things to "keep" are:

name .for TRUE FALSE NA Useful?
keep_rows() filter(.for = TRUE)
keep_rows() + keep NA filter(.for = c(TRUE, NA))
drop_rows() filter(.for = c(FALSE, NA))
drop_rows() + drop NA filter(.for = FALSE)
Keep NA only filter(.for = NA) ❌, keep_rows(df, if_all(cols, is.na))
Drop NA only filter(.for = c(TRUE, FALSE)) ❌, drop_rows(df, if_all(cols, is.na)), tidyr::drop_na()
Keep everything filter(.for = c(TRUE, FALSE, NA))
Drop everything filter(.for = logical())

So only 4 of these seem needed - the 4 that go with keep/drop_rows() already.

"Keep NA only" and "Drop NA only" aren't actually as useful as they might appear, because they require that the columns are already logical vectors to work. i.e. drop_na(df, x) works if x is a double column, but filter(df, x, .for = c(TRUE, FALSE)) doesn't. Since you can accomplish what you actually want with if_all(cols, is.na) or just drop_na(), we don't need to worry about these.

I think this also reveals that the signatures for missing could be:

keep_rows(data, ..., missing = c("drop", "keep", "error"))
drop_rows(data, ..., missing = c("keep", "drop", "error"))

which do what you'd expect them to do by default

DavisVaughan avatar Jul 24 '23 18:07 DavisVaughan

What about a .how = c("keep", "drop") arg ? The signature of the function would signal what the name doesn't, and no need for more aliases and deprecation, the .missing arg takes care of the other ambiguity. I find .for confusing.

Another confusion btw sometimes comes from how ... are combined, filter() does an intersection, while select() does a union, here also a new arg would both make the function more flexible and make the signature self explanatory regarding the default behavior.

moodymudskipper avatar Jul 30 '23 10:07 moodymudskipper

filter(
  .data,
  ...,
  .missing = NULL,
  .how = c("keep", "drop")
)

This signature may work, where .missing can still take on one of c("keep", "drop", "error"), but would default to:

  • .missing = "drop" for .how = "keep" (current behavior of filter())
  • .missing = "keep" for .how = "drop"

This way when you are trying to drop rows, it still only drops rows where the condition is TRUE.

I like that that value of .missing makes filter(df, cond, .how = "keep") and filter(df, cond, .how = "drop") completely symmetric, even regarding missing values (i.e. you can rbind the results and get df back).

I also like that we use the same keep / drop terms in both arguments for consistency

DavisVaughan avatar Jul 31 '23 16:07 DavisVaughan