poorman icon indicating copy to clipboard operation
poorman copied to clipboard

feat: Implement separate()

Open etiennebacher opened this issue 2 years ago • 7 comments

This PR implements separate() to split a column into several ones, either based on a regex or on location.

@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.

TODO:

  • [ ] fix when extra = "merge" (1 test failing so far)
  • [ ] implement argument fill (the way this argument works is not very clear to me)

Some examples:

suppressPackageStartupMessages(library(poorman))

df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df
#>      x
#> 1 <NA>
#> 2  x.y
#> 3  x.z
#> 4  y.z
df %>% separate(x, c("A", "B"))
#>      A    B
#> 1 <NA> <NA>
#> 2    x    y
#> 3    x    z
#> 4    y    z

df <- data.frame(x = c(NA, "a1b", "c4d", "e9g"))
df
#>      x
#> 1 <NA>
#> 2  a1b
#> 3  c4d
#> 4  e9g
df %>% separate(x, c("A","B"), sep = "[0-9]")
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    c    d
#> 4    e    g

df <- data.frame(x = c("x", "x y", "x y z", NA))
df
#>       x
#> 1     x
#> 2   x y
#> 3 x y z
#> 4  <NA>
df %>% separate(x, c("a", "b"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#>      a    b
#> 1    x <NA>
#> 2    x    y
#> 3    x    y
#> 4 <NA> <NA>

Created on 2022-08-03 by the reprex package (v2.0.1)

etiennebacher avatar Aug 03 '22 12:08 etiennebacher

Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.

nathaneastwood avatar Aug 03 '22 15:08 nathaneastwood

I took a look into some of this re extra = "merge". I think we could use the following to split up the strings

    n_max <- length(into)
    m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
    if (n_max > 0) {
      m <- lapply(m, function(x) {
        i <- seq_along(x) < n_max
        structure(
          x[i],
          match.length = attr(x, "match.length")[i],
          index.type = attr(x, "index.type"),
          useBytes = attr(x, "useBytes")
        )
      })
    }
    regmatches(as.character(data[[col]]), m, invert = TRUE)

The problem is this doesn't get rid of "extra" information.

df <- data.frame(x = c("x", "x y", "x y z", NA))
#      a    b
# 1    x <NA>
# 2    x    y
# 3    x  y z
# 4 <NA> <NA>

Row 3 should be x y with a warning. This is different to the approach you took which is using strsplit().

nathaneastwood avatar Aug 14 '22 16:08 nathaneastwood

Here is an example of what fill is supposed to do (taken from the tidyr tests):

r$> df                                                                 
# A tibble: 2 × 1
  x    
  <chr>
1 a b  
2 a b c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 NA    a     b    
2 a     b     c    

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")           
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c    
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].

nathaneastwood avatar Aug 14 '22 16:08 nathaneastwood

Passing thought: might be worth implementing tidyr's new functions separate_wider_delim(), separate_wider_position(), separate_wider_regex(). separate() would then only call one of these depending on the type of input

etiennebacher avatar Jan 26 '23 20:01 etiennebacher

I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.

nathaneastwood avatar Jan 26 '23 23:01 nathaneastwood

I understand that you can't cover all new things in dplyr and tidyr. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate(). And then, since those functions will exist, it won't cost much to export them.

etiennebacher avatar Jan 27 '23 12:01 etiennebacher

Ah I see what you mean. Yeah that seems like a good point, actually.

nathaneastwood avatar Jan 27 '23 13:01 nathaneastwood