stringr
stringr copied to clipboard
Performance improvements with factor levels
Base sub
, gsub
contain checks for factors which can drastically improve performance. Below is an example of some improvements that {stringr}
could get from this check. I'd image that the check would be inside the functions, and not implemented as an additional function (as it is below).
# contains a check for factor & levels
sub
#> function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
#> fixed = FALSE, useBytes = FALSE)
#> {
#> if (is.factor(x) && length(levels(x)) < length(x)) {
#> sub(pattern, replacement, levels(x), ignore.case, perl,
#> fixed, useBytes)[x]
#> }
#> else {
#> if (!is.character(x))
#> x <- as.character(x)
#> .Internal(sub(as.character(pattern), as.character(replacement),
#> x, ignore.case, perl, fixed, useBytes))
#> }
#> }
#> <bytecode: 0x0000021504f21ee8>
#> <environment: namespace:base>
foo <- function() sample(letters[1:5], 1e4, TRUE)
x <- paste0(foo(), foo(), foo())
fx <- factor(x)
library(stringr)
str_remove_fct <- function(string, pattern) {
str_remove(levels(string), pattern)[string]
}
res <- bench::mark(
sub("a", "", x),
sub("a", "", fx),
str_remove(x, "a"),
str_remove(fx, "a"),
str_remove_fct(fx, "a")
)
res
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 sub("a", "", x) 2.4ms 2.67ms 353. 78.2KB 0
#> 2 sub("a", "", fx) 86.8µs 145µs 6154. 79.2KB 8.58
#> 3 str_remove(x, "a") 2.34ms 3.7ms 237. 180.5KB 0
#> 4 str_remove(fx, "a") 2.28ms 2.69ms 297. 158.5KB 0
#> 5 str_remove_fct(fx, "a") 171.3µs 281.45µs 3299. 79.2KB 6.35
ggplot2::autoplot(res)
#> Loading required namespace: tidyr
Created on 2022-09-08 with reprex v2.0.2
Related issue: https://github.com/gagolews/stringi/issues/435
I think this is out of scope for stringr. You can use forcats::fct_relabel
+ stringr functions if this performance improvement is meaningful for your data.