stringr icon indicating copy to clipboard operation
stringr copied to clipboard

Performance improvements with factor levels

Open jmbarbone opened this issue 2 years ago • 1 comments

Base sub, gsub contain checks for factors which can drastically improve performance. Below is an example of some improvements that {stringr} could get from this check. I'd image that the check would be inside the functions, and not implemented as an additional function (as it is below).

# contains a check for factor & levels
sub
#> function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
#>     fixed = FALSE, useBytes = FALSE) 
#> {
#>     if (is.factor(x) && length(levels(x)) < length(x)) {
#>         sub(pattern, replacement, levels(x), ignore.case, perl, 
#>             fixed, useBytes)[x]
#>     }
#>     else {
#>         if (!is.character(x)) 
#>             x <- as.character(x)
#>         .Internal(sub(as.character(pattern), as.character(replacement), 
#>             x, ignore.case, perl, fixed, useBytes))
#>     }
#> }
#> <bytecode: 0x0000021504f21ee8>
#> <environment: namespace:base>

foo <- function() sample(letters[1:5], 1e4, TRUE)
x <- paste0(foo(), foo(), foo())
fx <- factor(x)

library(stringr)
str_remove_fct <- function(string, pattern) {
  str_remove(levels(string), pattern)[string]
}

res <- bench::mark(
  sub("a", "", x),
  sub("a", "", fx),
  str_remove(x, "a"),
  str_remove(fx, "a"),
  str_remove_fct(fx, "a")
)

res
#> # A tibble: 5 × 6
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sub("a", "", x)            2.4ms   2.67ms      353.    78.2KB     0   
#> 2 sub("a", "", fx)          86.8µs    145µs     6154.    79.2KB     8.58
#> 3 str_remove(x, "a")        2.34ms    3.7ms      237.   180.5KB     0   
#> 4 str_remove(fx, "a")       2.28ms   2.69ms      297.   158.5KB     0   
#> 5 str_remove_fct(fx, "a")  171.3µs 281.45µs     3299.    79.2KB     6.35

ggplot2::autoplot(res)
#> Loading required namespace: tidyr

Created on 2022-09-08 with reprex v2.0.2

jmbarbone avatar Sep 08 '22 16:09 jmbarbone

Related issue: https://github.com/gagolews/stringi/issues/435

gagolews avatar Sep 18 '22 05:09 gagolews

I think this is out of scope for stringr. You can use forcats::fct_relabel + stringr functions if this performance improvement is meaningful for your data.

hadley avatar Oct 01 '22 13:10 hadley