poorman
poorman copied to clipboard
feat: Implement separate()
This PR implements separate()
to split a column into several ones, either based on a regex or on location.
@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.
TODO:
- [ ] fix when
extra = "merge"
(1 test failing so far) - [ ] implement argument
fill
(the way this argument works is not very clear to me)
Some examples:
suppressPackageStartupMessages(library(poorman))
df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df
#> x
#> 1 <NA>
#> 2 x.y
#> 3 x.z
#> 4 y.z
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 x y
#> 3 x z
#> 4 y z
df <- data.frame(x = c(NA, "a1b", "c4d", "e9g"))
df
#> x
#> 1 <NA>
#> 2 a1b
#> 3 c4d
#> 4 e9g
df %>% separate(x, c("A","B"), sep = "[0-9]")
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 c d
#> 4 e g
df <- data.frame(x = c("x", "x y", "x y z", NA))
df
#> x
#> 1 x
#> 2 x y
#> 3 x y z
#> 4 <NA>
df %>% separate(x, c("a", "b"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#> a b
#> 1 x <NA>
#> 2 x y
#> 3 x y
#> 4 <NA> <NA>
Created on 2022-08-03 by the reprex package (v2.0.1)
Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.
I took a look into some of this re extra = "merge"
. I think we could use the following to split up the strings
n_max <- length(into)
m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
if (n_max > 0) {
m <- lapply(m, function(x) {
i <- seq_along(x) < n_max
structure(
x[i],
match.length = attr(x, "match.length")[i],
index.type = attr(x, "index.type"),
useBytes = attr(x, "useBytes")
)
})
}
regmatches(as.character(data[[col]]), m, invert = TRUE)
The problem is this doesn't get rid of "extra" information.
df <- data.frame(x = c("x", "x y", "x y z", NA))
# a b
# 1 x <NA>
# 2 x y
# 3 x y z
# 4 <NA> <NA>
Row 3 should be x y
with a warning. This is different to the approach you took which is using strsplit()
.
Here is an example of what fill
is supposed to do (taken from the tidyr tests):
r$> df
# A tibble: 2 × 1
x
<chr>
1 a b
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 NA a b
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 a b NA
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 a b NA
2 a b c
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Passing thought: might be worth implementing tidyr
's new functions separate_wider_delim()
, separate_wider_position()
, separate_wider_regex()
. separate()
would then only call one of these depending on the type of input
I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.
I understand that you can't cover all new things in dplyr
and tidyr
. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate()
. And then, since those functions will exist, it won't cost much to export them.
Ah I see what you mean. Yeah that seems like a good point, actually.