languageserver
languageserver copied to clipboard
Use `tree-sitter-r` to parse the document
Currently, the document is parsed manually in a very inefficient way.
For example, find_unbalanced_bracket painstakingly reads a line byte by byte and keeps track of how many open/close brackets it has encountered.
But with tree-sitter API, the parser would fill in missing closing brackets but label it as missing. So to find out whether there's unbalanced bracket, and if there is what is its byte offset, you can simply iterate over all child nodes of a syntax tree. And if a node with name }, ], or ) but labelled as missing, then it tells you it's unbalanced.
Similarly, to find out whether a token is enclosed_by_quotes, is just simple locate the token in the tree, using the token's byte offset, and if it's in a node named 'string', it's enclosed in quotes.
detect_comments is simply to find a node named "comment".
Some other example:
To provide semantic highlighting is simply finding all identifier node with same name in a node's parent node or ancestor nodes.
To find all references of a variable is to find all identifier node with same name in its sibling nodes (after it) and its child nodes. (Yes the search should end when the variable is shadowed. But you get the idea here.)
A more severe performance issue is that, the current implementation calculate byte offset by converting strings to UTF-16 BE. Well this is okay by itself, but in the meantime the C program does is to convert a line to a R String (from SEXP) and encode it in UTF-8 if it's not already.
This back-and-forth conversion just looks awful.
Using tree-sitter, you just need to parse the document to get a Tree and don't need to worry about the byte offset, row index, or column index as they are all included in a Node.
When the document is changed, you simply update the end byte offset and end row/column index. And have the parser parse the document again. Note that tree-sitter would reuse nodes of the old tree and only create new nodes when it's necessary.
There's no need to worry about encoding. tree-sitter accepts both UTF-8 and UTF-16 input.
Thanks for the suggestion and it looks interesting. For sure that much of our static analysing code are not tidy. We rely pretty much on xmlparsedata and some hacks methods.
But it remains unclear to me about a few things.
-
Create a standalone package. It seems that there is no R binding to the tree sitter API currently.
-
What happen if there are multiple syntax errors? Will it be able to pick up the last unmatched bracket before a cursor? The way that find_unbalanced_bracket works only checks a few lines before the cursor so it is not affected by non related syntax errors.
-
Does tree sitter also return row and column information based on UTF16 code points? It will ultimately resolve the need of the hacks conversion between Unicode code units and UTF16 code points.
We only support UTF-8 documents to there should be no translation from SEXP to chat buffer. I was using the translate variant of C code just to play safe.
tree-sitterindeed doesn't have R bindings (available bindings). But it has C bindings while this project is already using C. Also very importantly,tree-sitter-ris an official R language parser ofr-lib. We can be confident about the correctness of the parser.- I'll post some examples in next comment.
tree-sitteryields a full parse tree, where each node contains all information of the raw token, such as syntactical type(call, identifier, arguments, etc.) , start/end byte, start/end row and column index, etc.
Also, I've not mentioned in last comment. It's cheap to copy a parse tree because it's wrapped behind an atomic reference counted pointer, which also means it's thread-safe if you've multiple tasks that needs access to the tree.
I made a simple Rust program that prints the following R program.
x <- list(c(1, 2, 3)
y <- "bar
There are two errors.
- The parentheses are not balanced in the first assignment.
- Quotes are not balanced in the second assignment.
The Rust program is:
use tree_sitter::Parser;
use tree_sitter_r::language;
fn main() {
let mut parser = Parser::new();
parser.set_language(language()).unwrap();
let input = br#"
x <- list(c(1, 2, 3)
y <- "bar"#;
let tree = parser.parse(&input, None).unwrap();
let mut cursor = tree.walk();
let mut indent = vec![];
'outer: loop {
let node = cursor.node();
println!(
"{}{{{:?} {} - {}}}{}",
indent.concat(),
node.kind(),
node.start_position(),
node.end_position(),
if node.is_missing() { ", MISSING" } else { "" }
);
if cursor.goto_first_child() {
// depth-first traversal
indent.push("\t");
continue 'outer;
} else if cursor.goto_next_sibling() {
// when no child left, go to sibling
continue 'outer;
} else {
// end of subtree, backtrack
'inner: while cursor.goto_parent() {
indent.pop();
// parent is already printed before
if cursor.goto_next_sibling() {
// sibling is not printed
continue 'outer;
} else {
// no sibling, then go to parent
continue 'inner;
}
}
// no parent (at root), break
break 'outer;
}
}
}
The output is
{"program" (1, 0) - (2, 9)}
{"left_assignment" (1, 0) - (1, 9)}
{"identifier" (1, 0) - (1, 1)}
{"<-" (1, 2) - (1, 4)}
{"identifier" (1, 5) - (1, 9)}
{"ERROR" (1, 9) - (2, 9)}
{"(" (1, 9) - (1, 10)}
{"call" (1, 10) - (1, 20)}
{"identifier" (1, 10) - (1, 11)}
{"(" (1, 11) - (1, 12)}
{"arguments" (1, 12) - (1, 19)}
{"float" (1, 12) - (1, 13)}
{"," (1, 13) - (1, 14)}
{"float" (1, 15) - (1, 16)}
{"," (1, 16) - (1, 17)}
{"float" (1, 18) - (1, 19)}
{")" (1, 19) - (1, 20)}
{"identifier" (2, 0) - (2, 1)}
{"<-" (2, 2) - (2, 4)}
{"\"" (2, 5) - (2, 6)}
It correctly finds the first assignment, and parsed the LHS identifier x and the operator <- correctly.
But for the RHS list(c(1, 2, 3), due to the missing close parenthesis, it keeps parsing till the end of the input. Since there're multiple errors, rather than one missing parenthesis, it's not able to complete the tree by inserting ). Therefore the rest of the input is marked as ERROR. Within the Error
- It sees the
(afterlist. - It "incorrectly" parsed the
c(1, 2, 3)as a function call. Note that this is "incorrect" because the first assignment could, for example, bex <- list(c(1), 2, 3)instead. But I don't think static analysis is able to figure out the correct intent of the coder here. - Normally, if the parentheses are balanced, this is the end of funcall of
list(and<-). But here it keeps reading and seesy,<-, and". - The last token it parsed is from line 2 col 5 to line 2 col 6, i.e. the opening quote
". And tokens from line 2 col 6 to line 2 col 9, i.e.barand"is included in the tree as shown from the root node, but not parsed.
The information we can get from the syntax tree is:
- When traversing the tree, we found an
ERROR. Similar to other parsers, it guarantees that all tokens beforeERRORare lexically correct. So we can only need the tokens after theERRORto figure out what's going on. - The first child of
ERRORis(. In a correct input, all brackets or quotes must be balanced, and each pair of opening/closing parentheses/quotes would be the sibling child node of a parent. So we can check whether(has a sibling)to figure out whether it's balanced. - In summary, the parse tree tells us something went wrong from line 1 col 9. And we figured out that's because
(is not paired with a closing parenthesis.- If we want to display some error, e.g. mark an unbalanced bracket in different color, that's directly viable. Just mark
(at line 1 col 9 as unpaired parenthesis. - Inserting
)at line 2 col 6 does not necessarily fix the error. But if you want to provide some suggestions via code lens, you may try to insert)after each of the siblings of(, and lettree-sitterto parse it again.
- If we want to display some error, e.g. mark an unbalanced bracket in different color, that's directly viable. Just mark
Here's another example with slightly different input
x <- list(c(1, 2, 3)
y <- "bar"
Note that the only error is missing ) in the first assignment.
This time the parse tree is:
{"program" (1, 0) - (2, 10)}
{"left_assignment" (1, 0) - (2, 10)}
{"identifier" (1, 0) - (1, 1)}
{"<-" (1, 2) - (1, 4)}
{"call" (1, 5) - (2, 10)}
{"identifier" (1, 5) - (1, 9)}
{"(" (1, 9) - (1, 10)}
{"arguments" (1, 10) - (2, 10)}
{"call" (1, 10) - (1, 20)}
{"identifier" (1, 10) - (1, 11)}
{"(" (1, 11) - (1, 12)}
{"arguments" (1, 12) - (1, 19)}
{"float" (1, 12) - (1, 13)}
{"," (1, 13) - (1, 14)}
{"float" (1, 15) - (1, 16)}
{"," (1, 16) - (1, 17)}
{"float" (1, 18) - (1, 19)}
{")" (1, 19) - (1, 20)}
{"left_assignment" (2, 0) - (2, 10)}
{"identifier" (2, 0) - (2, 1)}
{"<-" (2, 2) - (2, 4)}
{"string" (2, 5) - (2, 10)}
{"\"" (2, 5) - (2, 6)}
{"\"" (2, 9) - (2, 10)}
{")" (2, 10) - (2, 10)}, MISSING
This time we didn't see ERROR. Instead tree-sitter is smart enough to insert the missing ) and marked it as missing. (But again not that this is only lexically correct and not necessarily what programmer want.)
What we get from the tree is:
- When traversing the tree, we see the
)at line 2 col 10 is in factmissing.tree-sitterhas API to go back to previous sibling. Via that, we find(at line 1 col 9 to line 1 col 10. And we conclude that this is the point where we have unbalanced bracket.
I saw the tree-sitter project a while ago, but didn't have time to investigate. I'm always interested to see if we could use a more error-tolerant, performant, and informative parser.
Thanks for providing the information and examples. I'll play a bit with tree-sitter in R.
Just googled this https://github.com/jimhester/rtreesitter/ it may be helpful.
this is incomplete. there's no cursor API.
this is incomplete. there's no cursor API.
There is now a TreeCursor API https://docs.rs/tree-sitter/latest/tree_sitter/struct.TreeCursor.html