lintr icon indicating copy to clipboard operation
lintr copied to clipboard

More `fixed_regex_linter` features

Open AshesITR opened this issue 2 years ago • 6 comments

There are some more cases where regexes can be optimized away:

For regex detection functions, we can safely replace

  • grepl("^static_rx", x) by startsWith(x, "static_rx")
  • grepl("static_rx$", x) by endsWith(x, "static_rx")
  • grepl("^static_rx$", x) by x == "static_rx"

For substitution functions, we can replace

  • gsub("^static_rx$", "replacement", x) by (function(.) { .[. == "static_rx"] <- "replacement"; . })(x)

AshesITR avatar May 21 '22 08:05 AshesITR

Benchmarks:

> x <- sample(letters, 1e3, TRUE)

> bench::mark(grepl("^a", x), startsWith(x, "a"))
# A tibble: 2 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("^a", x)      29.86µs  31.15µs    31909.        NA     3.19  9999     1    313.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 startsWith(x, "a")   2.22µs   2.28µs   426737.        NA    42.7   9999     1     23.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(grepl("a$", x), endsWith(x, "a"))
# A tibble: 2 × 13
  expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("a$", x)    29.81µs  30.27µs    32644.        NA     3.26  9999     1    306.3ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 endsWith(x, "a")   3.55µs   3.61µs   273457.        NA    54.7   9998     2     36.6ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(grepl("^a$", x), x == "a")
# A tibble: 2 × 13
  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 grepl("^a$", x)  30.06µs  30.54µs    32414.        NA     6.48  9998     2    308.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 x == "a"          2.54µs   2.58µs   383640.        NA    38.4   9999     1     26.1ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

> bench::mark(gsub("^a$", "", x), ifelse(x == "a", "", x))
# A tibble: 2 × 13
  expression                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time               gc                  
  <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>             <list>              
1 gsub("^a$", "", x)        85.3µs   86.2µs    11326.        NA     4.09  5542     2      489ms <chr [1,000]> <NULL> <bench_tm [5,544]> <tibble [5,544 × 3]>
2 ifelse(x == "a", "", x)  102.2µs  103.5µs     9609.        NA    17.3   4452     8      463ms <chr [1,000]> <NULL> <bench_tm [4,460]> <tibble [4,460 × 3]>

> bench::mark(gsub("^a$", "", x), dplyr::recode(x, "a" = ""))
# A tibble: 2 × 13
  expression                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time               gc                  
  <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>             <list>              
1 gsub("^a$", "", x)         85.5µs   86.9µs    11181.        NA     2.02  5535     1      495ms <chr [1,000]> <NULL> <bench_tm [5,536]> <tibble [5,536 × 3]>
2 dplyr::recode(x, a = "")   66.2µs   68.5µs    14303.        NA    50.9   5341    19      373ms <chr [1,000]> <NULL> <bench_tm [5,360]> <tibble [5,360 × 3]>

> bench::mark(gsub("^a$", "", x), (\(.) {.[. == "a"] <- ""; .})(x))
# A tibble: 2 × 13
  expression                                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory time                gc                   
  <bch:expr>                               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list> <list>              <list>               
1 gsub("^a$", "", x)                        82.84µs   85.9µs    11458.        NA     2.02  5673     1    495.1ms <chr [1,000]> <NULL> <bench_tm [5,674]>  <tibble [5,674 × 3]> 
2 (function(.) { .[. == "a"] <- "" . })(x)   6.33µs   6.79µs   145736.        NA    58.3   9996     4     68.6ms <chr [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

Interestingly, gsub("^a$", "", x) is faster than ifelse()

AshesITR avatar May 21 '22 08:05 AshesITR

we have a different linter for this actually. the linter looks for grepl and substr usages that can become startsWith/endsWith.

because of the substr part it got a different linter

MichaelChirico avatar May 21 '22 14:05 MichaelChirico

Fine by me, although the static regex regex would need to be duplicated in that case. Also for the == case?

AshesITR avatar May 21 '22 14:05 AshesITR

using the C version, we just reused is_not_regex after skipping the initial ^

MichaelChirico avatar May 21 '22 16:05 MichaelChirico

Oh, yeah. We can do the same. if (startsWith(x, "^") && is_not_regex(substr(x, 2L, nchar(x)))) ...

AshesITR avatar May 21 '22 19:05 AshesITR

This is mostly handled by string_boundary_linter.

What's left is to consider regexes like ^static$, but I am not sure how common they'll be.

MichaelChirico avatar Jun 01 '22 02:06 MichaelChirico

^ following the above comment, I'll close this & replace by a more focused issue extending string_boundary_linter for cases like ^static$.

MichaelChirico avatar Oct 04 '22 04:10 MichaelChirico