lintr
                                
                                 lintr copied to clipboard
                                
                                    lintr copied to clipboard
                            
                            
                            
                        Avoid //expr XPaths
As seen in #1353, #1340, #1310, there are subtle performance implications to the way we write our XPaths.
One thing that became clear is that writing //NODE1[NODE2] is slower than //NODE2[parent::NODE1] if NODE2 is far less frequent than NODE1.
And <expr> is by far the most common node; here's a guess at the general frequency by tabulating across r-devel and my local R packages/scripts:
library(data.table)
library(knitr)
get_r_files = function(dir) list.files(dir, pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
get_freq = function(f) {
  pc = tryCatch(parse(f), error = identity, warning = identity)
  if (inherits(pc, "condition")) return(NULL)
  setDT(getParseData(pc))[, .N, by = token]
}
r_files = c(get_r_files("~/github"), get_r_files("~/svn"))
token_freq = rbindlist(lapply(r_files, get_freq))
token_freq = token_freq[, .(N = sum(N)), by = token][order(-N)]
token_freq[, pct := round(100 * N/sum(N), 1)]
knitr::kable(token_freq[pct > 1])
| token | N | pct | 
|---|---|---|
| expr | 5713829 | 36.4 | 
| ',' | 1514487 | 9.6 | 
| SYMBOL | 1357662 | 8.6 | 
| '(' | 1119392 | 7.1 | 
| ')' | 1119392 | 7.1 | 
| NUM_CONST | 1086271 | 6.9 | 
| SYMBOL_FUNCTION_CALL | 917895 | 5.8 | 
| STR_CONST | 375598 | 2.4 | 
| LEFT_ASSIGN | 334988 | 2.1 | 
| COMMENT | 307772 | 2.0 | 
| EQ_SUB | 263926 | 1.7 | 
| SYMBOL_SUB | 254903 | 1.6 | 
i.e., //expr eliminates at most 2/3 of tokens, while other tokens typically eliminate >90% of the tree.
The trade-off here is for readability. XPaths with a lot of parent::/preceding-sibling::/following-sibling:: axes tend to be less readable -- our current XPaths are fairly readable IMO. Moreover, most of our linters are built around expression-level lints, and having a comparatively small tree is the norm in that case -- I guess the overhead of iterating over expressions is usually higher than the savings from fine-tuning XPaths, and that in the presence of cacheing, performance gains will be unnoticeable in all but edge cases.
So we should proceed gently on this issue. Some ideas:
- Write some helpers that have high readability but do translation to more performant XPaths "under the hood"
- Prioritize fixing it on file-level linters
- Wait for #778 to help quantify which linters are performing worse and prioritize those
~copied into the main issue body for tracking progress at a glance~
Just wanted to also include here a benchmark.
I am assuming the performance benefits scale with the complexity of the code tree.
library(xml2)
library(xmlparsedata)
x <- "switch(stat,
      o = {
        x <- 0.01
      },
      b = {
        x <- 0.05
      },
      # else
      {
        x <- 0.001
      }
    )"
xml <- xml_parse_data(getParseData(parse(text = x)), pretty = TRUE)
xpath1 <- "//expr[FUNCTION and @line1 != @line2 and not(expr[OP-LEFT-BRACE])]"
xpath2 <- "//FUNCTION/parent::expr[@line1 != @line2 and not(expr[OP-LEFT-BRACE])]"
xml <- as_xml_document(xml)
bench::mark(
  "with //expr"    = xml_find_all(xml, xpath1),
  "without //expr" = xml_find_all(xml, xpath2)
)
#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 with //expr      20.3µs   21.1µs    46530.    26.4KB     18.6
#> 2 without //expr   15.2µs   15.9µs    61757.        0B     30.9
Created on 2022-10-02 with reprex v2.0.2
see some of the other cited issues, e.g. #1348. the issue gets way worse on complex R files.