gumtree
gumtree copied to clipboard
No syntax exception raised in some cases for JDT
String literals seem to be ignored in some cases.
(tested with build 6584227)
Hi! well spotted! Can you attach the test files? Thanks!
Here they are. a0.java.txt a1.java.txt
Great!
Ok with a quick gumtree parse a0.java
I get
CompilationUnit [0,114]
TypeDeclaration [0,113]
TYPE_DECLARATION_KIND: class [0,5]
SimpleName: Foo [6,9]
MethodDeclaration [14,111]
Modifier: public [14,20]
Modifier: static [21,27]
Modifier: final [28,33]
Modifier: public [49,55]
Modifier: static [56,62]
PrimitiveType: void [63,67]
SimpleName: main [68,72]
Block [75,111]
ExpressionStatement [81,107]
MethodInvocation [81,106]
METHOD_INVOCATION_RECEIVER [81,91]
QualifiedName: System.out [81,91]
SimpleName: println [92,99]
METHOD_INVOCATION_ARGUMENTS [100,105]
StringLiteral: "bar" [100,105]
The field declaration is strangely completely ignored... This is strange. I have to look in the jdt generator code.
OK by opening the file I realize that there is no return type for the attribute which is a syntax error. If you add the type String to both files it works. I guess the big question is why there is no SyntaxError raised by JDT in this case ;-)
Confirmed. Thanks! How about comparing another pair of files with python-treesitter? a0.py.txt a1.py.txt String literals are missing in the parse trees...
Working on my side, no?
module [0,113]
expression_statement [0,9]
assignment [0,9]
identifier: s [0,1]
=: = [2,3]
string: 'bar' [4,9]
function_definition [12,112]
identifier: match [16,21]
parameters [21,30]
identifier: command [22,29]
block [36,112]
return_statement [36,112]
parenthesized_expression [43,112]
boolean_operator [44,111]
comparison_operator [44,74]
string: 'push' [44,50]
attribute [54,74]
identifier: command [54,61]
identifier: script_parts [62,74]
and: and [75,78]
comparison_operator [79,111]
string: 'set-upstream' [79,93]
attribute [97,111]
identifier: command [97,104]
identifier: output [105,111]⏎
Parsing results in the following in my environment...
module [0,113]
expression_statement [0,9]
assignment [0,9]
identifier: s [0,1]
=: = [2,3]
string [4,9]
": ' [4,5]
": ' [8,9]
function_definition [12,112]
def: def [12,15]
identifier: match [16,21]
parameters [21,30]
(: ( [21,22]
identifier: command [22,29]
): ) [29,30]
:: : [30,31]
block [36,112]
return_statement [36,112]
return: return [36,42]
parenthesized_expression [43,112]
(: ( [43,44]
boolean_operator [44,111]
comparison_operator [44,74]
string [44,50]
": ' [44,45]
": ' [49,50]
in: in [51,53]
attribute [54,74]
identifier: command [54,61]
.: . [61,62]
identifier: script_parts [62,74]
and: and [75,78]
comparison_operator [79,111]
string [79,93]
": " [79,80]
": " [92,93]
in: in [94,96]
attribute [97,111]
identifier: command [97,104]
.: . [104,105]
identifier: output [105,111]
): ) [111,112]
@jrfaller I can confirm that JDT Parser skips parts of the AST that are syntactically invalid. Missing type in a field declaration makes it syntactically invalid. JDT Parser will continue parsing the rest of the file.
@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?
@jrfaller Updating tree-sitter-parser resolved the problem. Thanks a lot!
@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?
For commit analysis, which cannot be guaranteed to be syntactically valid and compilable, I think what JDT Parser is doing is fine. Instead of failing the whole file, it just skips the syntactically invalid sub-trees. There should be a way to isolate syntax errors from the problem list. I will look into it.
That reminds me of something.
I have seen conflict markers >>>>>>>, =======, and <<<<<<< left even in release versions of some projects.
@jrfaller
We discovered that JDT ASTParser recommends a version to setup the parser, when it finds a parsing problem. For example, when parsing a record declaration with 1.8 compatibility, it generates a problem suggesting the use of Java 16 compatibility to parse record declarations.
We basically process the reported problems and re-parse the file with the recommended compatibility version. Here is our implementation in RefactoringMiner.
https://github.com/tsantalis/RefactoringMiner/blob/master/src/main/java/gr/uom/java/xmi/UMLModelASTReader.java#L160-L177
Hi @tsantalis and thanks a lot for the tips. In the last commit I switched to private final static String JAVA_VERSION = JavaCore.latestSupportedJavaVersion(); for compiler settings and ASTParser parser = ASTParser.newParser(AST.getJLSLatest()); for the parser initialization. Do you think anything can go wrong like this or parsers are backward compatible? Cheers!
@jrfaller
The parser initialization ASTParser parser = ASTParser.newParser(AST.getJLSLatest());
is totally fine and creates no problems.
The problem is with the compiler settings.
For example, since Java 16 record is a keyword reserved for record type declarations.
But before Java 16, you could use record as a variable name in your code.
(The same thing happened many years ago when enum was introduced as a keyword in Java 1.5,
but now it is very rare to find a commit using enum as a variable. The code should be older than 2004.)
If you put the latest Java version in the compiler settings, and parse a code using record as a variable
the parser will not generate a complete AST and will skip the parts using record as a variable name.
The best strategy is to start with a lower version compiler setting.
We start from JavaCore.VERSION_14 because we found that since this version the parser generates recommendations about what version to use to parse some code, which was not successfully parsed.
If we get such a recommendation in the problems, we re-parse the file using the recommended version in the compiler settings.
@tsantalis thanks a lot for the detailed explanation (and for pointing me the fix!)