funs icon indicating copy to clipboard operation
funs copied to clipboard

+ na_if()

Open romainfrancois opened this issue 4 years ago • 16 comments
trafficstars

closes #43

The example from https://github.com/tidyverse/dplyr/issues/5711 then can be:

library(dplyr, warn.conflicts = FALSE)
library(funs)
na_if <- funs::na_if

test <- tibble(Staff.Confirmed = c(0, 1, -999), Residents.Confirmed = c(12, -192, 0))
print(test)
#> # A tibble: 3 x 2
#>   Staff.Confirmed Residents.Confirmed
#>             <dbl>               <dbl>
#> 1               0                  12
#> 2               1                -192
#> 3            -999                   0

out <-  test %>% 
  mutate(staff_conf_naif = na_if(Staff.Confirmed, ~Staff.Confirmed < 0),
         staff_conf_ifelse = ifelse(Staff.Confirmed < 0, NA, Staff.Confirmed),
         
         res_conf_naif = na_if(Residents.Confirmed, ~ Residents.Confirmed < 0),
         res_conf_ifelse = ifelse(Residents.Confirmed < 0, NA, Residents.Confirmed)) %>% 
  select(Staff.Confirmed, staff_conf_naif, staff_conf_ifelse,
         Residents.Confirmed, res_conf_naif, res_conf_ifelse)
print(out)
#> # A tibble: 3 x 6
#>   Staff.Confirmed staff_conf_naif staff_conf_ifelse Residents.Confirmed
#>             <dbl>           <dbl>             <dbl>               <dbl>
#> 1               0               0                 0                  12
#> 2               1               1                 1                -192
#> 3            -999              NA                NA                   0
#> # … with 2 more variables: res_conf_naif <dbl>, res_conf_ifelse <dbl>

Created on 2021-05-06 by the reprex package (v2.0.0)

romainfrancois avatar May 06 '21 09:05 romainfrancois

I can simplify to vec_slice(x, vec_in(x, y)) <- NA but I believe it's interesting to allow y to be a predicate. Or maybe it should be another argument whose default is derived from y ?

na_if <- function(x, y, .fn = function(x) vec_in(x, y)){ ... }

romainfrancois avatar May 06 '21 09:05 romainfrancois

Probably needs some type checking, i.e.

funs::na_if(1:10, TRUE)
#>  [1] NA  2  3  4  5  6  7  8  9 10

Created on 2021-05-06 by the reprex package (v2.0.0)

romainfrancois avatar May 06 '21 12:05 romainfrancois

Is this expected ? cc @lionel- @DavisVaughan ?

vctrs::vec_in(1:4, TRUE)
#> [1]  TRUE FALSE FALSE FALSE

Created on 2021-05-06 by the reprex package (v2.0.0)

romainfrancois avatar May 06 '21 12:05 romainfrancois

Is this expected ?

I think so, because we use the common type which is integer in that case. Maybe we should directionally coerce instead though? This would make this case an error.

lionel- avatar May 06 '21 12:05 lionel-

Is it not a missed opportunity that this is only for setting NA. Perhaps we. Perhaps we can have :

#' @export
patch_if <- function(x, y, replacement) {
  if (is_formula(y)) {
    y <- as_function(y)
  }

  if (is_function(y)) {
    selected <- vec_assert(y(x), ptype = logical(), size = vec_size(x))
  } else {
    selected <- vec_in(x, y, needles_arg = "y", haystack_arg = "x")
  }
  vec_slice(x, selected) <- replacement

  x
}

romainfrancois avatar May 06 '21 14:05 romainfrancois

Hmmm, interesting idea.

hadley avatar May 06 '21 16:05 hadley

I could see this being two functions:

  • replace_values(x, what, with = NA)
  • replace_at(x, i, with = NA)

Where:

  • what is a vector of values with the same type as x
  • i is a valid subscript into x, or a predicate function generating a subscript into x. So it could be:
    • A logical vector
    • An integer vector of locations
    • A predicate function taking x and generating one of the above

I have always found the y argument of na_if() a bit confusing. It is hard to explain why, but has something to do with the pairing of "if" in the function name with the fact that you supply values to replace with NA. To me, "if" implied that there needed to be some kind of logical predicate involved

DavisVaughan avatar May 06 '21 17:05 DavisVaughan

Yeah, agreed that na_if is confusing. It was meant to be a direct translation of nullif() from SQL, but hardly anyone knows what that is so it doesn't help.

hadley avatar May 06 '21 17:05 hadley

replace_at() defined in this way would behave differently than the dplyr and purrr functions with the same suffix. If we use the existing naming scheme, replace_at() would take names or locations, and replace_if() would take a vectorised predicate or a logical vector.

These functions could also be named replace(), set_at(), set_if().

lionel- avatar May 06 '21 17:05 lionel-

Another variant to consider:

x |> set_across(starts_with("foo"), NA)

lionel- avatar May 06 '21 18:05 lionel-

In case it is useful, this is how I've named the functions in naniar for replacing values with NA

http://naniar.njtierney.com/articles/replace-with-na.html#notes-on-alternative-ways-to-handle-replacing-with-nas

njtierney avatar May 07 '21 05:05 njtierney

replace_at() defined in this way would behave differently than the dplyr and purrr functions with the same suffix.

To be clear, I think it might be a good idea to gather all these index semantics in a single function. This would be consistent with the move to across() in dplyr. In general the overloading of [ is an important part of the vector interface in R and I no longer think it's important to make explicit the kind of selection used at the call site (which is often clear from the code anyway).

I'm just worried about reusing _at in a different way than in purrr and the superseded dplyr functions. Maybe we don't need a suffix, e.g.

replace <- function(x, set, value) { ... }  # set: Set of values
set <- function(x, where, value) { ... }    # where: Locations, names, logicals, predicate

set(x, is.na, "foo")
set(x, x == "foo", NA)

lionel- avatar May 07 '21 07:05 lionel-

Further playing with the idea of "replacing many things" here https://github.com/tidyverse/funs/pull/66

library(magrittr)
library(funs, warn.conflicts = FALSE)

alphabet <- c(letters[1:10], NA)
alphabet %>% 
  patch(
    when(c("a", "e", "i", "o", "u"), "vowel"),
    when(NA                        , "missing"), 
    when(default                   , "consonent")
  )
#>  [1] "vowel" "b"     "c"     "d"     "vowel" "f"     "g"     "h"     "vowel"
#> [10] "j"     NA

x <- 1:10
x %>% 
  patch(
    when(~.x < 3   , 3), 
    when(~. > 7    , 7), 
    when(c(4, 5, 6), NA)
  )
#>  [1]  3  3  3 NA NA NA  7  7  7  7

x %>% 
  patch(
    when(x < 3     , 3), 
    when(x > 7     , 7), 
    when(c(4, 5, 6), NA)
  )
#>  [1]  3  3  3 NA NA NA  7  7  7  7

Created on 2021-05-07 by the reprex package (v2.0.0)

romainfrancois avatar May 07 '21 08:05 romainfrancois

na_if() then is:

library(funs, warn.conflicts = FALSE)

x <- 1:10
na_if  <- function(x, what) {
  patch(x, when(what, NA))
}
na_if(x, x == 2)
#>  [1]  1 NA  3  4  5  6  7  8  9 10

Created on 2021-05-07 by the reprex package (v2.0.0)

romainfrancois avatar May 07 '21 08:05 romainfrancois

This feels highly related to https://twitter.com/antoine_fabri/status/1392127389195452416, which I have wanted a better solution to for a while now. The key here is that with is allowed to be vectorized with the same length as x, not the same length as which(where), which is why base::replace() wouldn't work.

library(dplyr)
library(vctrs)

replace_if <- function(x, where, with) {
  x_size <- vec_size(x)
  
  vec_assert(where, ptype = logical(), size = x_size, arg = "where")
  
  with <- vec_recycle(with, x_size, x_arg = "with")
  with <- vec_cast(with, x, x_arg = "with", to_arg = "x")
  
  with <- vec_slice(with, where)
  
  vec_assign(x, where, with, x_arg = "x", value_arg = "with")
}

band_instruments %>%
  mutate(
    name = replace_if(name, plays == "guitar", paste0(name, "!")),
    plays2 = replace_if(plays, plays == "bass", NA)
  )
#> # A tibble: 3 x 3
#>   name   plays  plays2
#>   <chr>  <chr>  <chr> 
#> 1 John!  guitar guitar
#> 2 Paul   bass   <NA>  
#> 3 Keith! guitar guitar

DavisVaughan avatar May 11 '21 16:05 DavisVaughan

Still from #66 and its proposed patch(when()) syntax, allowing to replace multiple things:

library(dplyr, warn.conflicts = FALSE)
library(funs, warn.conflicts = FALSE)

band_instruments %>%
  mutate(
    name = patch(name, 
      when(plays == "guitar", paste0(name, "!")), 
      when(plays == "bass", paste0(name, "@"))
    )
  )
#> # A tibble: 3 x 2
#>   name   plays 
#>   <chr>  <chr> 
#> 1 John!  guitar
#> 2 Paul@  bass  
#> 3 Keith! guitar

Created on 2021-05-17 by the reprex package (v2.0.0)

Using when() here, or something else gives us the patch(...) so that we can replace multiple things, and when(what=, with=) instead of a formula as in case_when() allows the use of formula for what= and (maybe but not yet) with=

romainfrancois avatar May 17 '21 08:05 romainfrancois