lintr
lintr copied to clipboard
More `fixed_regex_linter` features
There are some more cases where regexes can be optimized away:
For regex detection functions, we can safely replace
-
grepl("^static_rx", x)
bystartsWith(x, "static_rx")
-
grepl("static_rx$", x)
byendsWith(x, "static_rx")
-
grepl("^static_rx$", x)
byx == "static_rx"
For substitution functions, we can replace
-
gsub("^static_rx$", "replacement", x)
by(function(.) { .[. == "static_rx"] <- "replacement"; . })(x)
Benchmarks:
> x <- sample(letters, 1e3, TRUE)
> bench::mark(grepl("^a", x), startsWith(x, "a"))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 grepl("^a", x) 29.86µs 31.15µs 31909. NA 3.19 9999 1 313.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 startsWith(x, "a") 2.22µs 2.28µs 426737. NA 42.7 9999 1 23.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
> bench::mark(grepl("a$", x), endsWith(x, "a"))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 grepl("a$", x) 29.81µs 30.27µs 32644. NA 3.26 9999 1 306.3ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 endsWith(x, "a") 3.55µs 3.61µs 273457. NA 54.7 9998 2 36.6ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
> bench::mark(grepl("^a$", x), x == "a")
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 grepl("^a$", x) 30.06µs 30.54µs 32414. NA 6.48 9998 2 308.4ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
2 x == "a" 2.54µs 2.58µs 383640. NA 38.4 9999 1 26.1ms <lgl [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
> bench::mark(gsub("^a$", "", x), ifelse(x == "a", "", x))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 gsub("^a$", "", x) 85.3µs 86.2µs 11326. NA 4.09 5542 2 489ms <chr [1,000]> <NULL> <bench_tm [5,544]> <tibble [5,544 × 3]>
2 ifelse(x == "a", "", x) 102.2µs 103.5µs 9609. NA 17.3 4452 8 463ms <chr [1,000]> <NULL> <bench_tm [4,460]> <tibble [4,460 × 3]>
> bench::mark(gsub("^a$", "", x), dplyr::recode(x, "a" = ""))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 gsub("^a$", "", x) 85.5µs 86.9µs 11181. NA 2.02 5535 1 495ms <chr [1,000]> <NULL> <bench_tm [5,536]> <tibble [5,536 × 3]>
2 dplyr::recode(x, a = "") 66.2µs 68.5µs 14303. NA 50.9 5341 19 373ms <chr [1,000]> <NULL> <bench_tm [5,360]> <tibble [5,360 × 3]>
> bench::mark(gsub("^a$", "", x), (\(.) {.[. == "a"] <- ""; .})(x))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 gsub("^a$", "", x) 82.84µs 85.9µs 11458. NA 2.02 5673 1 495.1ms <chr [1,000]> <NULL> <bench_tm [5,674]> <tibble [5,674 × 3]>
2 (function(.) { .[. == "a"] <- "" . })(x) 6.33µs 6.79µs 145736. NA 58.3 9996 4 68.6ms <chr [1,000]> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
Interestingly, gsub("^a$", "", x)
is faster than ifelse()
we have a different linter for this actually. the linter looks for grepl and substr usages that can become startsWith/endsWith.
because of the substr part it got a different linter
Fine by me, although the static regex regex would need to be duplicated in that case.
Also for the ==
case?
using the C version, we just reused is_not_regex after skipping the initial ^
Oh, yeah. We can do the same. if (startsWith(x, "^") && is_not_regex(substr(x, 2L, nchar(x)))) ...
This is mostly handled by string_boundary_linter
.
What's left is to consider regexes like ^static$
, but I am not sure how common they'll be.
^ following the above comment, I'll close this & replace by a more focused issue extending string_boundary_linter
for cases like ^static$
.