parzer Benchmarking replacing C++ by R code

Goal

In an effort to make parzer easier to maintain, all C++ code was translated to base R code and stringi functions.

Methods

Compared versions are aacd37 for the C++ version and 6be82b5 for the R version.

Results

C++ functions seem to be constantly faster than R versions. To a point that the weight of maintaining a C++ package seems justified.

R> microbenchmark::microbenchmark(
        R = parzeR::parse_lat("32 4 46"),
        cpp = parzer::parse_lat("32 4 46")
      )
Unit: microseconds
 expr   min    lq  mean median    uq    max neval
    R 86.22 87.37 90.44  88.38 90.30 212.05   100
  cpp 21.69 22.55 24.59  23.62 24.15  93.28   100
R> 
R> microbenchmark::microbenchmark(
        R = parzeR::parse_lon("32 4 46"),
        cpp = parzer::parse_lon("32 4 46")
      )
Unit: microseconds
 expr   min    lq  mean median    uq   max neval
    R 90.45 91.68 94.42  92.64 93.95 201.1   100
  cpp 24.19 24.89 26.96  25.87 26.40 107.8   100
R> 
R> microbenchmark::microbenchmark(
        R = parzeR::parse_latlon("32 4 46N, 37 02 34.5E"),
        cpp = parzer::parse_llstr("32 4 46N, 37 02 34.5E")
      )
Unit: microseconds
 expr   min    lq  mean median    uq   max neval
    R 250.5 257.8 272.6  263.9 273.5 488.5   100
  cpp 197.3 207.3 221.9  213.2 222.2 349.9   100
R> microbenchmark::microbenchmark(
        times = 15,
        R = parzeR::parse_latlon(rep("32 4 46N, 37 02 34.5E", 10^4)),
        cpp = parzer::parse_llstr(rep("32 4 46N, 37 02 34.5E", 10^4))
      )
Unit: milliseconds
 expr    min     lq mean median   uq    max neval
    R 1908.3 1911.5 1926 1915.2 1923 1978.7    15
  cpp  405.4  406.4  409  407.7  410  418.4    15
R> 
R> microbenchmark::microbenchmark(
        times = 15,
        R = parzeR::parse_lon_lat(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4)),
        cpp = parzer::parse_lon_lat(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4))
      )
Unit: milliseconds
 expr   min    lq  mean median    uq   max neval
    R 574.7 578.6 583.1  582.6 587.3 594.3    15
  cpp 200.4 201.7 209.5  203.8 208.2 270.4    15
There were 50 or more warnings (use warnings() to see the first 50)
R> 
R> microbenchmark::microbenchmark(
        times = 15,
        R = parzeR::parse_hemisphere(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4)),
        cpp = parzer::parse_hemisphere(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4))
      )
Unit: milliseconds
 expr   min    lq  mean median    uq   max neval
    R 580.6 587.9 594.0  590.2 592.3 650.4    15
  cpp 199.6 200.0 205.5  200.9 203.2 261.6    15
There were 50 or more warnings (use warnings() to see the first 50)

Apr 07 '25 16:04 AlbanSagouis

@AlbanSagouis Sorry for having dropped this for so long, but I have been pondering it still. I had in the meantime suspected that internal R or stringi routines would have been just as fast, so this is great that you've confirmed that they're not. The other approach I was considering was https://github.com/kokke/tiny-regex-c, which looks promising, and should definitely be faster than C++.

Note also that https://github.com/tdhock/nc exposes several different regex engines, including RE2 via https://github.com/girishji/re2. Would it help your current efforts if I looked into tiny-regex-c, or explored those other options a bit?

Apr 08 '25 13:04 mpadge

Hi @mpadge

I'm interested :)

I just ran a quick comparison of very simple functions written with nc (2025.1.21), stringi (1.8.7) and parzer aacd37 Rcpp + std::regex library. The function parzer:::extract_nsew is not usually available, I added the Rcpp export key and built the package again.

Results

R> # One row
R> test_data <- data.table::data.table(
        lat = rnorm(1, mean = 45, sd =10) |> abs() |> paste0("N"),
        lon = rnorm(1, mean = 90, sd =10) |> abs() |> paste0("E"))
R> 
R> microbenchmark::microbenchmark(
        times = 200,
        ncPCRE = nc::capture_first_vec(test_data$lat,
                                       cardinal = "[NSEW]",
                                       nomatch.error = FALSE,
                                       engine = "PCRE"),
        ncRE2 = nc::capture_first_vec(test_data$lat,
                                      cardinal = "[NSEW]",
                                      nomatch.error = FALSE,
                                      engine = "RE2"),
        ncICU = nc::capture_first_vec(test_data$lat,
                                      cardinal = "[NSEW]",
                                      nomatch.error = FALSE,
                                      engine = "ICU"),
        stringi = {stringi::stri_extract_first_regex(test_data$lat, "[NSEW]") |> 
            data.table::as.data.table() |> 
            stats::setNames("cardinal")},
        parzerVapply = {vapply(test_data$lat, function(x) parzer:::extract_nsew(x, "[NSEW]"),
                               character(1)) |> 
            data.table::as.data.table() |> 
            stats::setNames("cardinal")},
        parzerDT = test_data[j = .(cardinal = parzer:::extract_nsew(lat, "[NSEW]")),
                             by = .I]
      )
Unit: microseconds
         expr    min     lq   mean median     uq    max neval
       ncPCRE 166.91 179.60 195.30 190.08 202.31  374.5   200
        ncRE2 172.32 185.20 240.32 195.08 208.73 4191.3   200
        ncICU 160.60 175.28 187.71 185.12 194.83  264.4   200
      stringi  79.58  88.87  96.35  93.58  98.99  297.4   200
 parzerVapply  79.25  89.40  97.09  94.07 100.82  314.3   200
     parzerDT 183.64 204.04 221.14 213.51 229.21  408.7   200
R> 
R> 
R> # Many rows
R> test_data <- data.table::data.table(
        lat = rnorm(10^5, mean = 45, sd =10) |> abs() |> paste0("N"),
        lon = rnorm(10^5, mean = 90, sd =10) |> abs() |> paste0("E"))
R> 
R> microbenchmark::microbenchmark(
        times = 20,
        ncPCRE = nc::capture_first_vec(test_data$lat,
                                       cardinal = "[NSEW]",
                                       nomatch.error = FALSE,
                                       engine = "PCRE"),
        ncRE2 = nc::capture_first_vec(test_data$lat,
                                      cardinal = "[NSEW]",
                                      nomatch.error = FALSE,
                                      engine = "RE2"),
        ncICU = nc::capture_first_vec(test_data$lat,
                                      cardinal = "[NSEW]",
                                      nomatch.error = FALSE,
                                      engine = "ICU"),
        stringi = {stringi::stri_extract_first_regex(test_data$lat, "[NSEW]") |> 
            data.table::as.data.table() |> 
            stats::setNames("cardinal")},
        parzerVapply = {vapply(test_data$lat, function(x) parzer:::extract_nsew(x, "[NSEW]"),
                               character(1)) |> 
            data.table::as.data.table() |> 
            stats::setNames("cardinal")},
        parzerDT = test_data[j = .(cardinal = parzer:::extract_nsew(lat, "[NSEW]")),
                             by = .I]
      )
Unit: milliseconds
         expr    min     lq   mean median     uq    max neval
       ncPCRE  12.67  12.80  13.08  12.98  13.45  13.68    20
        ncRE2  27.44  27.59  29.15  27.82  28.51  39.91    20
        ncICU  23.44  23.73  24.65  23.92  24.46  33.62    20
      stringi  13.04  13.35  13.49  13.45  13.65  13.95    20
 parzerVapply 618.50 626.69 644.84 640.00 660.08 706.30    20
     parzerDT 646.43 652.28 659.47 657.60 662.09 684.82    20

For reference, I think this is an exhaustive list of stringi functions used in the pure R / stringi version of the package. stringi::stri_replace_all_regex() stringi::stri_split_regex(), stringi::stri_split_fixed() stringi::stri_extract_all_regex(), stringi::stri_extract_first_regex() stringi::stri_trans_tolower() stringi::stri_length() stringi::stri_trim_both() stringi::stri_count_fixed()

And I assume most would have a nc equivalent if needed.

Conclusions

The fact that stringi is as fast or faster than the Rcpp parzer implementation in this little example might indicate that the speed difference between the R parzer version and the C++ parzer might come from computations unrelated to regex?
nc does not seem faster than stringi here and it adds a significant dependence on data.table but I like data.table and probably nc would shine more if associated to data.table.
@mpadge , yes, I think it would be helpful if you looked into [tiny-regex-c](https://github.com/kokke/tiny-regex-c).
Do you still think cpp11 would be faster than Rcpp?

Apr 08 '25 17:04 AlbanSagouis