lintr Integration with roxygen2: data structure for get_source

There are a number of linters it would be nice to write regarding roxygen2 documentation, e.g. ensuring @param entries are in sync with the declaration (of course @inheritParams makes this hard to to completely, but certainly @param X should only happen when X is indeed a parameter.

We could also apply the rest of the lint suite directly to code in @examples.

Parsing the roxygen markup is easy enough: roxygen2::parse_file(). The key will be integrating this result with our existing data structure from get_source_expressions(), which I don't see a super clean way to do while also ensuring we keep the link between which doc entry corresponds to which source expression.

As a reminder, the current structure is laid out here: https://lintr.r-lib.org/reference/get_source_expressions.html#value.

We also likely want to keep the data structure back-compatible unless we want this to become a version 4.0.0 thing.

So filing this issue first for design discussion.

Jul 13 '23 18:07 MichaelChirico

I like both suggested use cases. We should think about how to convey the linters that an expression is from parsed Rd so that the linter can be smart about if and what to lint. e.g. line_length_linter() should ignore these expressions since the possible lints are already covered by the raw comment expressions.

For linters that want to cross-reference code and docs, we need to extend the file-level expressions.

Jul 14 '23 19:07 AshesITR

Another idea would be to just lint the man/ directory by adding support for Rd files to get_source_expressions.

That will let us lint non-roxygen2 packages' examples and should be pretty straightforward to implement and integrate to the existing package.

OTOH I suspect many roxygen2 authors have never opened the man/ directory, and it would mean lints showing up in a different directory from where the examples are written and maintained.

We also generate Rd files "on the fly" at build time, not storing the Rd files in source control. Examples could not be linted in similar setups.

Aug 09 '23 15:08 MichaelChirico

Good ideas, just want to mention here the recent prototype project https://openpharma.github.io/roxylint by @dgkf

Aug 09 '23 15:08 danielinteractive

Nice! That looks more general since i don't think rules for titles etc are in scope & we'll stick to linting just the examples.

Aug 09 '23 15:08 MichaelChirico

We could definitely add a default lintr for examples in roxylint that uses lintr, though I do think it makes sense as well to make this a lintr-native feature, especially if it's used by languageserver for real-time lint hints.

Having them as part of a roxygen2::roxygenize call is nice, but slow, and the machinery just isn't there for giving real-time feedback in the same way.

Just some other ideas that further complicate this, @examples may contain lots of additional Rd-tex such as \dontrun{}, \dontshow{}, \donttest{} and \testonly{}, which probably means that any linting would first have to render the code output of the example and then lint it, possibly with the added complexity of mapping rendered example code lint locations back to roxygen2 source code locations.

I did a bit of this in https://github.com/dgkf/testex to report test failure locations, but it's definitely non-trivial.

Aug 09 '23 15:08 dgkf

A thought for evolving the data structure. get_source_expressions() (ref here for current return object) to return:

list(
  expressions = expressions_,
  error = error_
  #, lines = character() < remove as redundant
)

expressions_ = list(
  file = file_object_,
  chunk = chunk_object_,
  expression = expression_object_,
  roxygen = roxygen_object_
)

error = Lint(...) # same as now

{file,chunk,expression,roxygen}_object_ = list(
  filename = character(),
  # line = integer(), < remove as redundant
  # column = integer(), < remove as redundant
  source_lines = named_character(),
  source_content = character(),
  parsed_content = data.frame(),
  xml_parsed_content = xml_document(),
  terminal_newline = logical()
)

This would be a breaking change. It would introduce more consistency (e.g. no obscure difference in full_xml_parsed_content vs. xml_parsed_content), and pare off some more redundant bits (expressions[[ii]]$line is apparently only used in cyclocomp_linter() and IINM could easily be worked around).

I am a bit apprehensive about roxygen not being directly linked to expression, we probably need some more careful though about how to make sure we can easily write linters about issues involving both the roxygen header & the linked code.

Sep 14 '23 17:09 MichaelChirico

For the reason you mentioned, I'd suggest instead extending the expression level to allow for roxygen metadata.

All roxygen blocks would have to belong to a top-level expression, which is currently the case IINM.

We need to be aware that these might not be contiguous though, cf. roxygen for R6 classes.

Sep 14 '23 21:09 AshesITR

Another idea: get_source_expressions() could replace <COMMENT> nodes with <ROXYGEN2> nodes that include an {roxygen2}-specific sub-tree. @examples become "normal" xml_parse_data() sub-trees:

<ROXYGEN2>
  <roxy-examples>
    <exprlist>
      ...
    </exprlist>
  </roxy-examples>
</ROXYGEN2>

That way:

Normal XPaths can basically "just work", e.g. //SYMBOL_FUNCTION_CALL will find parsed calls in {roxygen2} @examples
Issues like #1908 should go away as the <COMMENT> node throwing the lint is gone
We can handle concerns about which expression a given #' comment is attaching to with appropriate choice of parent in the tree. Things are still a bit wonky because the data structure of roxygen2 comments is fundamentally dual -- the comment gets parsed as part of a potentially non-contiguous block. So people might want a linter that expects contiguity but it'll be hard to maintain that. It also means expression caching may not work "as expected" because the whole block will be cached alongside the expression to which it attaches, and editing one expression in an example will invalidate the cache for the whole function. I think this is fine but something to keep an eye out for.

It might make sense for {xmlparsedata} to handle this, so cc @gaborcsardi. But we'll have to edit the getParseData() data.frame as well, so my first instinct is it's best for us to handle the implementation here.

Dec 01 '23 02:12 MichaelChirico

Concrete example:

#' A function
#' @param x An input
#' @export
foo <- function(x) x

Current parse tree:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<exprlist>
  <COMMENT line1="1" col1="1" line2="1" col2="13" start="22" end="34">#' A function</COMMENT>
  <COMMENT line1="2" col1="1" line2="2" col2="20" start="43" end="62">#' @param x An input</COMMENT>
  <COMMENT line1="3" col1="1" line2="3" col2="10" start="64" end="73">#' @export</COMMENT>
  <expr line1="4" col1="1" line2="4" col2="20" start="85" end="104">
    <expr line1="4" col1="1" line2="4" col2="3" start="85" end="87">
      <SYMBOL line1="4" col1="1" line2="4" col2="3" start="85" end="87">foo</SYMBOL>
    </expr>
    <LEFT_ASSIGN line1="4" col1="5" line2="4" col2="6" start="89" end="90">&lt;-</LEFT_ASSIGN>
    <expr line1="4" col1="8" line2="4" col2="20" start="92" end="104">
      <FUNCTION line1="4" col1="8" line2="4" col2="15" start="92" end="99">function</FUNCTION>
      <OP-LEFT-PAREN line1="4" col1="16" line2="4" col2="16" start="100" end="100">(</OP-LEFT-PAREN>
      <SYMBOL_FORMALS line1="4" col1="17" line2="4" col2="17" start="101" end="101">x</SYMBOL_FORMALS>
      <OP-RIGHT-PAREN line1="4" col1="18" line2="4" col2="18" start="102" end="102">)</OP-RIGHT-PAREN>
      <expr line1="4" col1="20" line2="4" col2="20" start="104" end="104">
        <SYMBOL line1="4" col1="20" line2="4" col2="20" start="104" end="104">x</SYMBOL>
      </expr>
    </expr>
  </expr>
</exprlist>

roxygen2::parse_text() gives us the roxygen2 AST:

<roxy_block> [<text>:4]
  $tag
    [line:  1] @title 'A function' {parsed}
    [line:  2] @param 'x An input' {parsed}
    [line:  3] @export '' {parsed}
    [line:  4] @usage '<generated>' {parsed}
    [line:  4] @.formals '<generated>' {parsed}
    [line:  4] @backref '<generated>' {parsed}
  $call   foo <- function(x) x
  $object <function> 
    $topic foo
    $alias foo

Possible adjusted parse tree:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<exprlist>
  <roxy_block>
    <roxy-tag-title>A function</roxy-tag-title>
    <roxy-tag-param>
      @param
      <roxy-tag-param-name>x</roxy-tag-param-name>
      <roxy-tag-param-desc>An input</roxy-tag-param-desc>
    <roxy-tag-param>
    <roxy-tag-export>@export</roxy-tag-export>
    <expr line1="4" col1="1" line2="4" col2="20" start="85" end="104">
      <expr line1="4" col1="1" line2="4" col2="3" start="85" end="87">
        <SYMBOL line1="4" col1="1" line2="4" col2="3" start="85" end="87">foo</SYMBOL>
      </expr>
      <LEFT_ASSIGN line1="4" col1="5" line2="4" col2="6" start="89" end="90">&lt;-</LEFT_ASSIGN>
      <expr line1="4" col1="8" line2="4" col2="20" start="92" end="104">
        <FUNCTION line1="4" col1="8" line2="4" col2="15" start="92" end="99">function</FUNCTION>
        <OP-LEFT-PAREN line1="4" col1="16" line2="4" col2="16" start="100" end="100">(</OP-LEFT-PAREN>
        <SYMBOL_FORMALS line1="4" col1="17" line2="4" col2="17" start="101" end="101">x</SYMBOL_FORMALS>
        <OP-RIGHT-PAREN line1="4" col1="18" line2="4" col2="18" start="102" end="102">)</OP-RIGHT-PAREN>
        <expr line1="4" col1="20" line2="4" col2="20" start="104" end="104">
          <SYMBOL line1="4" col1="20" line2="4" col2="20" start="104" end="104">x</SYMBOL>
        </expr>
      </expr>
    </expr>
  </roxy_block>
</exprlist>

Dec 01 '23 03:12 MichaelChirico

That looks pretty good to me. IDK where it should be handled, I don't mind putting it into xmlparsedata, either, with having roxygen2 an optional dependency, or vendoring the roxygen2 parser, which is pretty stable at this point.

Dec 01 '23 10:12 gaborcsardi

Why have the roxy_block be a parent of the expr? Shouldn't a //roxy_block/following-sibling::expr suffice if they need to be linked?

Dec 01 '23 23:12 AshesITR

Why have the roxy_block be a parent of the expr? Shouldn't a //roxy_block/following-sibling::expr suffice if they need to be linked?

This does match the roxy_block structure (which has element $call). I don't have a strong sense ex ante of whether one or the other is preferable. One thing to look out for is code like:

#' @export
# TODO: deprecate this
foo <- function(x) x

If <roxy_block> and foo's <expr> are neighbors, it's easier to write (mistaken) logic finding the # TODO comment that's excluded from the roxygen structure.

A bit speculative but the following comes to mind: following-sibling::expr works in the simple case but it might not generalize? Would we need following-sibling::*[self::expr or self::expr_or_help_or_assign] to handle = assignment? And any logic needing following-sibling::* risks pulling in dozens or hundreds of nodes in a long R source file.

Dec 03 '23 06:12 MichaelChirico

The roxygen2 comments may appear up to the end of the expression, and this is actually used, e.g. for most documented R6 classes. Here is a non-R6 example with a lot of inline docs: https://github.com/r-lib/cli/blob/ef77a1606543c9fe135fadd8faba52573a8c922e/R/num-ansi-colors.R#L36

Dec 03 '23 20:12 gaborcsardi

Integration with roxygen2: data structure for get_source_expressions()