tidyr icon indicating copy to clipboard operation
tidyr copied to clipboard

Odd behaviour with `extract()` when some elements are absent

Open olivroy opened this issue 2 years ago • 0 comments

When the pattern is not always respected, extract() does not behave smoothly. All new columns without a full match become NA. I think it would be better to at least have a warning for that.


I wonder if it would be possible to have similar arguments to fill and extra to extract(), similar to separate()

I know it is easy to solve with dplyr and a couple more steps, but I think that extract() provides a clean way to perform this task.

dat <- tibble::tibble(
  x = c("foo (1)", "foo1", "foo2 (2)", "foo3 (3)")
)
dat |> 
    tidyr::extract(
        x, 
        into = c("x1", "x2"),
        regex = "(.+) \\((\\d+)\\)",
        convert = TRUE
    )
#> # A tibble: 4 x 2
#>   x1       x2
#>   <chr> <int>
#> 1 foo       1
#> 2 <NA>     NA
#> 3 foo2      2
#> 4 foo3      3

# Using separate and its extra arguments can work for that.
dat |> 
  tidyr::separate(
    x, 
    into = c("x1", "x2"),
    sep = " "
  )
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 4 x 2
#>   x1    x2   
#>   <chr> <chr>
#> 1 foo   (1)  
#> 2 foo1  <NA> 
#> 3 foo2  (2)  
#> 4 foo3  (3)
#  

Created on 2022-06-29 by the reprex package (v2.0.1)

You can silent the separate() warning by specifying fill = "right". Maybe a similar argument could be useful in extract()

Expected output

# With a warning
#> # A tibble: 4 x 2
#>   x1    x2   
#>   <chr> <chr>
#> 1 foo   1    
#> 2 foo1  <NA> 
#> 3 foo2  2    
#> 4 foo3  3
# with a code that would look like that.
dat |> 
    tidyr::extract(
        x, 
        into = c("x1", "x2"),
        regex = "(.+) \\((\\d+)\\)",
        convert = TRUE, 
        fill = "right"
    )

olivroy avatar Jun 29 '22 16:06 olivroy