Benchmarking replacing C++ by R code
Goal
In an effort to make parzer easier to maintain, all C++ code was translated to base R code and stringi functions.
Methods
Compared versions are aacd37 for the C++ version and 6be82b5 for the R version.
Results
C++ functions seem to be constantly faster than R versions. To a point that the weight of maintaining a C++ package seems justified.
R> microbenchmark::microbenchmark(
R = parzeR::parse_lat("32 4 46"),
cpp = parzer::parse_lat("32 4 46")
)
Unit: microseconds
expr min lq mean median uq max neval
R 86.22 87.37 90.44 88.38 90.30 212.05 100
cpp 21.69 22.55 24.59 23.62 24.15 93.28 100
R>
R> microbenchmark::microbenchmark(
R = parzeR::parse_lon("32 4 46"),
cpp = parzer::parse_lon("32 4 46")
)
Unit: microseconds
expr min lq mean median uq max neval
R 90.45 91.68 94.42 92.64 93.95 201.1 100
cpp 24.19 24.89 26.96 25.87 26.40 107.8 100
R>
R> microbenchmark::microbenchmark(
R = parzeR::parse_latlon("32 4 46N, 37 02 34.5E"),
cpp = parzer::parse_llstr("32 4 46N, 37 02 34.5E")
)
Unit: microseconds
expr min lq mean median uq max neval
R 250.5 257.8 272.6 263.9 273.5 488.5 100
cpp 197.3 207.3 221.9 213.2 222.2 349.9 100
R> microbenchmark::microbenchmark(
times = 15,
R = parzeR::parse_latlon(rep("32 4 46N, 37 02 34.5E", 10^4)),
cpp = parzer::parse_llstr(rep("32 4 46N, 37 02 34.5E", 10^4))
)
Unit: milliseconds
expr min lq mean median uq max neval
R 1908.3 1911.5 1926 1915.2 1923 1978.7 15
cpp 405.4 406.4 409 407.7 410 418.4 15
R>
R> microbenchmark::microbenchmark(
times = 15,
R = parzeR::parse_lon_lat(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4)),
cpp = parzer::parse_lon_lat(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4))
)
Unit: milliseconds
expr min lq mean median uq max neval
R 574.7 578.6 583.1 582.6 587.3 594.3 15
cpp 200.4 201.7 209.5 203.8 208.2 270.4 15
There were 50 or more warnings (use warnings() to see the first 50)
R>
R> microbenchmark::microbenchmark(
times = 15,
R = parzeR::parse_hemisphere(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4)),
cpp = parzer::parse_hemisphere(rep("32 4 46N", 10^4), rep("37 02 34.5E", 10^4))
)
Unit: milliseconds
expr min lq mean median uq max neval
R 580.6 587.9 594.0 590.2 592.3 650.4 15
cpp 199.6 200.0 205.5 200.9 203.2 261.6 15
There were 50 or more warnings (use warnings() to see the first 50)
@AlbanSagouis Sorry for having dropped this for so long, but I have been pondering it still. I had in the meantime suspected that internal R or stringi routines would have been just as fast, so this is great that you've confirmed that they're not. The other approach I was considering was https://github.com/kokke/tiny-regex-c, which looks promising, and should definitely be faster than C++.
Note also that https://github.com/tdhock/nc exposes several different regex engines, including RE2 via https://github.com/girishji/re2. Would it help your current efforts if I looked into tiny-regex-c, or explored those other options a bit?
Hi @mpadge
I'm interested :)
I just ran a quick comparison of very simple functions written with nc (2025.1.21), stringi (1.8.7) and parzer aacd37 Rcpp + std::regex library. The function parzer:::extract_nsew is not usually available, I added the Rcpp export key and built the package again.
Results
R> # One row
R> test_data <- data.table::data.table(
lat = rnorm(1, mean = 45, sd =10) |> abs() |> paste0("N"),
lon = rnorm(1, mean = 90, sd =10) |> abs() |> paste0("E"))
R>
R> microbenchmark::microbenchmark(
times = 200,
ncPCRE = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "PCRE"),
ncRE2 = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "RE2"),
ncICU = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "ICU"),
stringi = {stringi::stri_extract_first_regex(test_data$lat, "[NSEW]") |>
data.table::as.data.table() |>
stats::setNames("cardinal")},
parzerVapply = {vapply(test_data$lat, function(x) parzer:::extract_nsew(x, "[NSEW]"),
character(1)) |>
data.table::as.data.table() |>
stats::setNames("cardinal")},
parzerDT = test_data[j = .(cardinal = parzer:::extract_nsew(lat, "[NSEW]")),
by = .I]
)
Unit: microseconds
expr min lq mean median uq max neval
ncPCRE 166.91 179.60 195.30 190.08 202.31 374.5 200
ncRE2 172.32 185.20 240.32 195.08 208.73 4191.3 200
ncICU 160.60 175.28 187.71 185.12 194.83 264.4 200
stringi 79.58 88.87 96.35 93.58 98.99 297.4 200
parzerVapply 79.25 89.40 97.09 94.07 100.82 314.3 200
parzerDT 183.64 204.04 221.14 213.51 229.21 408.7 200
R>
R>
R> # Many rows
R> test_data <- data.table::data.table(
lat = rnorm(10^5, mean = 45, sd =10) |> abs() |> paste0("N"),
lon = rnorm(10^5, mean = 90, sd =10) |> abs() |> paste0("E"))
R>
R> microbenchmark::microbenchmark(
times = 20,
ncPCRE = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "PCRE"),
ncRE2 = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "RE2"),
ncICU = nc::capture_first_vec(test_data$lat,
cardinal = "[NSEW]",
nomatch.error = FALSE,
engine = "ICU"),
stringi = {stringi::stri_extract_first_regex(test_data$lat, "[NSEW]") |>
data.table::as.data.table() |>
stats::setNames("cardinal")},
parzerVapply = {vapply(test_data$lat, function(x) parzer:::extract_nsew(x, "[NSEW]"),
character(1)) |>
data.table::as.data.table() |>
stats::setNames("cardinal")},
parzerDT = test_data[j = .(cardinal = parzer:::extract_nsew(lat, "[NSEW]")),
by = .I]
)
Unit: milliseconds
expr min lq mean median uq max neval
ncPCRE 12.67 12.80 13.08 12.98 13.45 13.68 20
ncRE2 27.44 27.59 29.15 27.82 28.51 39.91 20
ncICU 23.44 23.73 24.65 23.92 24.46 33.62 20
stringi 13.04 13.35 13.49 13.45 13.65 13.95 20
parzerVapply 618.50 626.69 644.84 640.00 660.08 706.30 20
parzerDT 646.43 652.28 659.47 657.60 662.09 684.82 20
For reference, I think this is an exhaustive list of stringi functions used in the pure R / stringi version of the package.
stringi::stri_replace_all_regex()
stringi::stri_split_regex(), stringi::stri_split_fixed()
stringi::stri_extract_all_regex(), stringi::stri_extract_first_regex()
stringi::stri_trans_tolower()
stringi::stri_length()
stringi::stri_trim_both()
stringi::stri_count_fixed()
And I assume most would have a nc equivalent if needed.
Conclusions
- The fact that
stringiis as fast or faster than theRcppparzerimplementation in this little example might indicate that the speed difference between the Rparzerversion and the C++parzermight come from computations unrelated to regex? -
ncdoes not seem faster thanstringihere and it adds a significant dependence ondata.tablebut I likedata.tableand probablyncwould shine more if associated todata.table. - @mpadge , yes, I think it would be helpful if you looked into
[tiny-regex-c](https://github.com/kokke/tiny-regex-c). - Do you still think
cpp11would be faster thanRcpp?