tree-sitter-cpp icon indicating copy to clipboard operation
tree-sitter-cpp copied to clipboard

Content of string literals not captured

Open spaanse opened this issue 4 years ago • 5 comments

I am experimenting with using tree-sitter for reformatting C++

I wrote a python script (couldn't make it work in C++) that takes the syntax tree generated by tree-sitter and uses that to output "reformatted" code. Currently it traverses the syntax tree in order and for every leaf node (child_count == 0) it finds it prints the part of the input that start_byte and end_byte specify. I am also adding whitespace here and there (as these do and should get lost in parsing), but that is not perfect yet.

While the majority of the code survives, the content of string literals disappear. See the example below:

Input:

#include <iostream>

int main() {
	cout << "Hello World!" << endl;
}

Output:

#include<iostream>

int main(){
	cout<<""<<endl;
	}

spaanse avatar Mar 12 '21 17:03 spaanse

Yeah, this is expected. Some text can belong to a parent node, and not be part of any leaf node. We could change the C/C++ grammar so that this wouldn't be the case in the specific case of string literals, but I think in general, you'd want to change your Python code to not drop all such text.

maxbrunsfeld avatar Mar 12 '21 17:03 maxbrunsfeld

Yeah if you dump the ast for this code:

`- {translation_unit:228} from (1, 1) to (6, 1) 
   |- {preproc_include:229} from (1, 1) to (3, 1) 
   |  |- {#include:2} from (1, 1) to (1, 9) : #include 
   |  |- {system_lib_string:122} from (1, 10) to (1, 20) : <iostream> 
   |  `- {
:3} from (1, 20) to (3, 1) 
   `- {function_definition:249} from (3, 1) to (5, 2) 
      |- {primitive_type:74} from (3, 1) to (3, 4) : int 
      |- {function_declarator:273} from (3, 5) to (3, 11) : main() 
      |  |- {identifier:1} from (3, 5) to (3, 9) : main 
      |  `- {parameter_list:296} from (3, 9) to (3, 11) : () 
      |     |- {(:5} from (3, 9) to (3, 10) : ( 
      |     `- {):8} from (3, 10) to (3, 11) : ) 
      `- {compound_statement:282} from (3, 12) to (5, 2) 
         |- {{:56} from (3, 12) to (3, 13) : { 
         |- {expression_statement:299} from (4, 2) to (4, 33) : cout << "Hello World!" << endl; 
         |  |- {binary_expression:316} from (4, 2) to (4, 32) : cout << "Hello World!" << endl 
         |  |  |- {binary_expression:316} from (4, 2) to (4, 24) : cout << "Hello World!" 
         |  |  |  |- {identifier:1} from (4, 2) to (4, 6) : cout 
         |  |  |  |- {<<:37} from (4, 7) to (4, 9) : << 
         |  |  |  `- {string_literal:333} from (4, 10) to (4, 24) : "Hello World!" 
         |  |  |     |- {":119} from (4, 10) to (4, 11) : " 
         |  |  |     `- {":119} from (4, 23) to (4, 24) : " 
         |  |  |- {<<:37} from (4, 25) to (4, 27) : << 
         |  |  `- {identifier:1} from (4, 28) to (4, 32) : endl 
         |  `- {;:39} from (4, 32) to (4, 33) : ; 
         `- {}:57} from (5, 1) to (5, 2) : } 

You should probably write some specific callbacks depending of type of the node.

calixteman avatar Mar 12 '21 17:03 calixteman

It looks like it isn't captured as it's own leaf node because it is an immediate token, take the following example:

#include <iostream>

int main() {
	cout << "\x1b[32mHello World!\x1b[0m" << endl;
}

Which goes to

#include<iostream>

int main(){
	cout<<"\x1b\x1b"<<endl;
	}

As the name suggests it is a token and therefore should be captured. However that seems less an issue of the C++ grammar but rather in tree-sitter or the python implementation of tree-sitter

spaanse avatar Mar 12 '21 17:03 spaanse

I looked a bit into the grammer of C (as that is where the string_literal comes from), and found no reason why the node is left out. Am I missing something or is there something wrong in tree-sitter?

The string_literal grammar rule is as follows:

string_literal: $ => seq(
  choice('L"', 'u"', 'U"', 'u8"', '"'),
  repeat(choice(
    token.immediate(prec(1, /[^\\"\n]+/)),
    $.escape_sequence
  )),
  '"',
)

The content(s) of the string are an immediate token. From the documentation:

Tokens : token(rule) - This function marks the given rule as producing only a single token [...] Each token is matched separately by the lexer and returned as its own leaf node in the tree [...] Immediate Tokens : token.immediate(rule) - Usually, whitespace (and any other extras, such as comments) is optional before each token. This function means that the token will only match if there is no whitespace.

This suggests that the content of the string should become a node (or more if the string has escape sequences). From the list of modifiers there is nothing to suggest the node should be removed from the tree. I suspect that the combination of a immediate token and a regex is the cause (the two rules having that break); though I don't know why.

spaanse avatar Mar 12 '21 20:03 spaanse

It's working as expected. The only nodes that show up in the tree are:

  • named rules
  • string literals

Right now, regex patterns and arbitrary token rules are not visible in the tree, unless you give them a name. Otherwise, there's no obvious name to give them.

maxbrunsfeld avatar Mar 12 '21 20:03 maxbrunsfeld

string content is exposed now

amaanq avatar Jul 25 '23 10:07 amaanq