languageserver icon indicating copy to clipboard operation
languageserver copied to clipboard

Use `tree-sitter-r` to parse the document

Open lebensterben opened this issue 4 years ago • 8 comments
trafficstars

Currently, the document is parsed manually in a very inefficient way.

For example, find_unbalanced_bracket painstakingly reads a line byte by byte and keeps track of how many open/close brackets it has encountered.

But with tree-sitter API, the parser would fill in missing closing brackets but label it as missing. So to find out whether there's unbalanced bracket, and if there is what is its byte offset, you can simply iterate over all child nodes of a syntax tree. And if a node with name }, ], or ) but labelled as missing, then it tells you it's unbalanced.

Similarly, to find out whether a token is enclosed_by_quotes, is just simple locate the token in the tree, using the token's byte offset, and if it's in a node named 'string', it's enclosed in quotes.

detect_comments is simply to find a node named "comment".

Some other example: To provide semantic highlighting is simply finding all identifier node with same name in a node's parent node or ancestor nodes. To find all references of a variable is to find all identifier node with same name in its sibling nodes (after it) and its child nodes. (Yes the search should end when the variable is shadowed. But you get the idea here.)


A more severe performance issue is that, the current implementation calculate byte offset by converting strings to UTF-16 BE. Well this is okay by itself, but in the meantime the C program does is to convert a line to a R String (from SEXP) and encode it in UTF-8 if it's not already.

This back-and-forth conversion just looks awful.

Using tree-sitter, you just need to parse the document to get a Tree and don't need to worry about the byte offset, row index, or column index as they are all included in a Node.

When the document is changed, you simply update the end byte offset and end row/column index. And have the parser parse the document again. Note that tree-sitter would reuse nodes of the old tree and only create new nodes when it's necessary.

There's no need to worry about encoding. tree-sitter accepts both UTF-8 and UTF-16 input.

lebensterben avatar Sep 17 '21 22:09 lebensterben

Thanks for the suggestion and it looks interesting. For sure that much of our static analysing code are not tidy. We rely pretty much on xmlparsedata and some hacks methods.

But it remains unclear to me about a few things.

  1. Create a standalone package. It seems that there is no R binding to the tree sitter API currently.

  2. What happen if there are multiple syntax errors? Will it be able to pick up the last unmatched bracket before a cursor? The way that find_unbalanced_bracket works only checks a few lines before the cursor so it is not affected by non related syntax errors.

  3. Does tree sitter also return row and column information based on UTF16 code points? It will ultimately resolve the need of the hacks conversion between Unicode code units and UTF16 code points.

We only support UTF-8 documents to there should be no translation from SEXP to chat buffer. I was using the translate variant of C code just to play safe.

randy3k avatar Sep 18 '21 16:09 randy3k

  1. tree-sitter indeed doesn't have R bindings (available bindings). But it has C bindings while this project is already using C. Also very importantly, tree-sitter-r is an official R language parser of r-lib. We can be confident about the correctness of the parser.
  2. I'll post some examples in next comment.
  3. tree-sitter yields a full parse tree, where each node contains all information of the raw token, such as syntactical type(call, identifier, arguments, etc.) , start/end byte, start/end row and column index, etc.

Also, I've not mentioned in last comment. It's cheap to copy a parse tree because it's wrapped behind an atomic reference counted pointer, which also means it's thread-safe if you've multiple tasks that needs access to the tree.

lebensterben avatar Sep 18 '21 20:09 lebensterben

I made a simple Rust program that prints the following R program.

x <- list(c(1, 2, 3)
y <- "bar

There are two errors.

  • The parentheses are not balanced in the first assignment.
  • Quotes are not balanced in the second assignment.

The Rust program is:

use tree_sitter::Parser;
use tree_sitter_r::language;

fn main() {
    let mut parser = Parser::new();
    parser.set_language(language()).unwrap();

    let input = br#"
x <- list(c(1, 2, 3)
y <- "bar"#;
    let tree = parser.parse(&input, None).unwrap();

    let mut cursor = tree.walk();
    let mut indent = vec![];
    'outer: loop {
        let node = cursor.node();
        println!(
            "{}{{{:?} {} - {}}}{}",
            indent.concat(),
            node.kind(),
            node.start_position(),
            node.end_position(),
            if node.is_missing() { ", MISSING" } else { "" }
        );
        if cursor.goto_first_child() {
            // depth-first traversal
            indent.push("\t");
            continue 'outer;
        } else if cursor.goto_next_sibling() {
            // when no child left, go to sibling
            continue 'outer;
        } else {
            // end of subtree, backtrack
            'inner: while cursor.goto_parent() {
                indent.pop();
                // parent is already printed before
                if cursor.goto_next_sibling() {
                    // sibling is not printed
                    continue 'outer;
                } else {
                    // no sibling, then go to parent
                    continue 'inner;
                }
            }
            // no parent (at root), break
            break 'outer;
        }
    }
}

The output is

{"program" (1, 0) - (2, 9)}
	{"left_assignment" (1, 0) - (1, 9)}
		{"identifier" (1, 0) - (1, 1)}
		{"<-" (1, 2) - (1, 4)}
		{"identifier" (1, 5) - (1, 9)}
	{"ERROR" (1, 9) - (2, 9)}
		{"(" (1, 9) - (1, 10)}
		{"call" (1, 10) - (1, 20)}
			{"identifier" (1, 10) - (1, 11)}
			{"(" (1, 11) - (1, 12)}
			{"arguments" (1, 12) - (1, 19)}
				{"float" (1, 12) - (1, 13)}
				{"," (1, 13) - (1, 14)}
				{"float" (1, 15) - (1, 16)}
				{"," (1, 16) - (1, 17)}
				{"float" (1, 18) - (1, 19)}
			{")" (1, 19) - (1, 20)}
		{"identifier" (2, 0) - (2, 1)}
		{"<-" (2, 2) - (2, 4)}
		{"\"" (2, 5) - (2, 6)}

It correctly finds the first assignment, and parsed the LHS identifier x and the operator <- correctly. But for the RHS list(c(1, 2, 3), due to the missing close parenthesis, it keeps parsing till the end of the input. Since there're multiple errors, rather than one missing parenthesis, it's not able to complete the tree by inserting ). Therefore the rest of the input is marked as ERROR. Within the Error

  • It sees the ( after list.
  • It "incorrectly" parsed the c(1, 2, 3) as a function call. Note that this is "incorrect" because the first assignment could, for example, be x <- list(c(1), 2, 3) instead. But I don't think static analysis is able to figure out the correct intent of the coder here.
  • Normally, if the parentheses are balanced, this is the end of funcall of list (and <-). But here it keeps reading and sees y, <-, and ".
  • The last token it parsed is from line 2 col 5 to line 2 col 6, i.e. the opening quote ". And tokens from line 2 col 6 to line 2 col 9, i.e. bar and " is included in the tree as shown from the root node, but not parsed.

The information we can get from the syntax tree is:

  • When traversing the tree, we found an ERROR. Similar to other parsers, it guarantees that all tokens before ERROR are lexically correct. So we can only need the tokens after the ERROR to figure out what's going on.
  • The first child of ERROR is (. In a correct input, all brackets or quotes must be balanced, and each pair of opening/closing parentheses/quotes would be the sibling child node of a parent. So we can check whether ( has a sibling ) to figure out whether it's balanced.
  • In summary, the parse tree tells us something went wrong from line 1 col 9. And we figured out that's because ( is not paired with a closing parenthesis.
    • If we want to display some error, e.g. mark an unbalanced bracket in different color, that's directly viable. Just mark ( at line 1 col 9 as unpaired parenthesis.
    • Inserting ) at line 2 col 6 does not necessarily fix the error. But if you want to provide some suggestions via code lens, you may try to insert ) after each of the siblings of (, and let tree-sitter to parse it again.

lebensterben avatar Sep 18 '21 21:09 lebensterben

Here's another example with slightly different input

x <- list(c(1, 2, 3)
y <- "bar"

Note that the only error is missing ) in the first assignment.

This time the parse tree is:

{"program" (1, 0) - (2, 10)}
	{"left_assignment" (1, 0) - (2, 10)}
		{"identifier" (1, 0) - (1, 1)}
		{"<-" (1, 2) - (1, 4)}
		{"call" (1, 5) - (2, 10)}
			{"identifier" (1, 5) - (1, 9)}
			{"(" (1, 9) - (1, 10)}
			{"arguments" (1, 10) - (2, 10)}
				{"call" (1, 10) - (1, 20)}
					{"identifier" (1, 10) - (1, 11)}
					{"(" (1, 11) - (1, 12)}
					{"arguments" (1, 12) - (1, 19)}
						{"float" (1, 12) - (1, 13)}
						{"," (1, 13) - (1, 14)}
						{"float" (1, 15) - (1, 16)}
						{"," (1, 16) - (1, 17)}
						{"float" (1, 18) - (1, 19)}
					{")" (1, 19) - (1, 20)}
				{"left_assignment" (2, 0) - (2, 10)}
					{"identifier" (2, 0) - (2, 1)}
					{"<-" (2, 2) - (2, 4)}
					{"string" (2, 5) - (2, 10)}
						{"\"" (2, 5) - (2, 6)}
						{"\"" (2, 9) - (2, 10)}
			{")" (2, 10) - (2, 10)}, MISSING

This time we didn't see ERROR. Instead tree-sitter is smart enough to insert the missing ) and marked it as missing. (But again not that this is only lexically correct and not necessarily what programmer want.)

What we get from the tree is:

  • When traversing the tree, we see the ) at line 2 col 10 is in fact missing. tree-sitter has API to go back to previous sibling. Via that, we find ( at line 1 col 9 to line 1 col 10. And we conclude that this is the point where we have unbalanced bracket.

lebensterben avatar Sep 18 '21 21:09 lebensterben

I saw the tree-sitter project a while ago, but didn't have time to investigate. I'm always interested to see if we could use a more error-tolerant, performant, and informative parser.

Thanks for providing the information and examples. I'll play a bit with tree-sitter in R.

renkun-ken avatar Sep 19 '21 00:09 renkun-ken

Just googled this https://github.com/jimhester/rtreesitter/ it may be helpful.

randy3k avatar Sep 19 '21 07:09 randy3k

this is incomplete. there's no cursor API.

lebensterben avatar Sep 19 '21 11:09 lebensterben

this is incomplete. there's no cursor API.

There is now a TreeCursor API https://docs.rs/tree-sitter/latest/tree_sitter/struct.TreeCursor.html

JosiahParry avatar Mar 02 '24 01:03 JosiahParry