lintr Avoid //expr XPaths

As seen in #1353, #1340, #1310, there are subtle performance implications to the way we write our XPaths.

One thing that became clear is that writing //NODE1[NODE2] is slower than //NODE2[parent::NODE1] if NODE2 is far less frequent than NODE1.

And <expr> is by far the most common node; here's a guess at the general frequency by tabulating across r-devel and my local R packages/scripts:

library(data.table)
library(knitr)

get_r_files = function(dir) list.files(dir, pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
get_freq = function(f) {
  pc = tryCatch(parse(f), error = identity, warning = identity)
  if (inherits(pc, "condition")) return(NULL)
  setDT(getParseData(pc))[, .N, by = token]
}

r_files = c(get_r_files("~/github"), get_r_files("~/svn"))
token_freq = rbindlist(lapply(r_files, get_freq))
token_freq = token_freq[, .(N = sum(N)), by = token][order(-N)]
token_freq[, pct := round(100 * N/sum(N), 1)]
knitr::kable(token_freq[pct > 1])

token	N	pct
expr	5713829	36.4
','	1514487	9.6
SYMBOL	1357662	8.6
'('	1119392	7.1
')'	1119392	7.1
NUM_CONST	1086271	6.9
SYMBOL_FUNCTION_CALL	917895	5.8
STR_CONST	375598	2.4
LEFT_ASSIGN	334988	2.1
COMMENT	307772	2.0
EQ_SUB	263926	1.7
SYMBOL_SUB	254903	1.6

i.e., //expr eliminates at most 2/3 of tokens, while other tokens typically eliminate >90% of the tree.

The trade-off here is for readability. XPaths with a lot of parent::/preceding-sibling::/following-sibling:: axes tend to be less readable -- our current XPaths are fairly readable IMO. Moreover, most of our linters are built around expression-level lints, and having a comparatively small tree is the norm in that case -- I guess the overhead of iterating over expressions is usually higher than the savings from fine-tuning XPaths, and that in the presence of cacheing, performance gains will be unnoticeable in all but edge cases.

So we should proceed gently on this issue. Some ideas:

Write some helpers that have high readability but do translation to more performant XPaths "under the hood"
Prioritize fixing it on file-level linters
Wait for #778 to help quantify which linters are performing worse and prioritize those

Jun 04 '22 17:06 MichaelChirico

~copied into the main issue body for tracking progress at a glance~

Jun 19 '22 13:06 AshesITR

Just wanted to also include here a benchmark.

I am assuming the performance benefits scale with the complexity of the code tree.

library(xml2)
library(xmlparsedata)

x <- "switch(stat,
      o = {
        x <- 0.01
      },
      b = {
        x <- 0.05
      },
      # else
      {
        x <- 0.001
      }
    )"

xml <- xml_parse_data(getParseData(parse(text = x)), pretty = TRUE)

xpath1 <- "//expr[FUNCTION and @line1 != @line2 and not(expr[OP-LEFT-BRACE])]"
xpath2 <- "//FUNCTION/parent::expr[@line1 != @line2 and not(expr[OP-LEFT-BRACE])]"

xml <- as_xml_document(xml)

bench::mark(
  "with //expr"    = xml_find_all(xml, xpath1),
  "without //expr" = xml_find_all(xml, xpath2)
)
#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 with //expr      20.3µs   21.1µs    46530.    26.4KB     18.6
#> 2 without //expr   15.2µs   15.9µs    61757.        0B     30.9

^{Created on 2022-10-02 with reprex v2.0.2}

Oct 02 '22 03:10 IndrajeetPatil

see some of the other cited issues, e.g. #1348. the issue gets way worse on complex R files.

Oct 02 '22 04:10 MichaelChirico

lintr lintr copied to clipboard

Avoid //expr XPaths

lintr
lintr copied to clipboard