tree-sitter-c
tree-sitter-c copied to clipboard
Error parsing preprocessor statements with a space in front after stubbed macros
I'm noticing a misparse when a #define is preceded by a space.
#define MACRO_STUB
#define MACRO x
yields this parse:
(translation_unit (preproc_def name: (identifier)) (preproc_def name: (identifier) value: (preproc_arg)))
but if I insert a space before the second #define like this:
#define MACRO_STUB
#define MACRO x
yields this (incorrect) parse:
(translation_unit (preproc_def name: (identifier) value: (preproc_arg)))
Notably, the parser still produces correct output if the first macro is not stubbed:
#define MACRO_NOSTUB x
#define MACRO x
(translation_unit (preproc_def name: (identifier) value: (preproc_arg)) (preproc_def name: (identifier) value: (preproc_arg)))
I tried like one change so far that didn't work (since I don't understand the parser well), changing preproc_arg to:
preproc_arg: $ => token(prec(-1, repeat1(/[^\n]|\\\r?\n/))),
Thanks for the report.
Have you consider implementing the preproc directives as extras
? They can be legally used in uncountable ways.
Another example that ts-c fails to parse:
const char foo[] {
#include "filename"
};
From what I read about extras, is that they live outside of the grammar. So anything classified under extras would not appear in the parsed tree.
Implementing the preproc as extras would mean that a syntax highlighter using ts-c could not highlight any preproc statement, since the preproc wouldn't appear in the parsed tree.
There are also cases where implementing as extras would cause trouble:
if (cond_a ||
#if X
cond_b
#else
cond_c
#endif
) {}
Which is not my favorite code to look at, but I've seen it before.
So parsing c plus the preproc correctly is not easy, since you really need full knowledge of the compile in order to do it correctly. But I still think an attempt should be made to parse the preproc in order to handle common cases where people aren't getting too crazy with the preprocessor.
Now I'm taking a second look, the extras field is how comments are handled in the parser, so it might work to add the preprocessor there.
I did some more testing and changing to token immediate seems to handle the issue and pass all the tests:
--- a/grammar.js
+++ b/grammar.js
@@ -113,7 +115,8 @@ module.exports = grammar({
...preprocIf('_in_field_declaration_list', $ => $._field_declaration_list_item),
preproc_directive: $ => /#[ \t]*[a-zA-Z]\w*/,
- preproc_arg: $ => token(prec(-1, repeat1(/.|\\\r?\n/))),
+ preproc_arg: $ => token.immediate(prec(-1, repeat1(/[^\n]|\\\r?\n/))),
_preproc_expression: $ => choice(
$.identifier,
However, I'm almost certain it won't have the correct behaviour for a macro followed by a comment. I also need to check out if this adds spaces to the preproc_arg
token when parsing (I'm guessing it will).
So there is more work to be done, but I think changing to use token.immediate is the way to go, because the c grammar's extras are causing issues in the preproc.
I'm guessing the root cause is that stubbed macros consume the first newline because it's counted as an extras
, and then eats the rest of the next macro because it hasn't encountered a newline yet. Then a parsing tie with the next line is broken by tree-sitter using match length see here. I'm not sure why I wasn't able to resolve the issue using precedence though.
In the extreme case, a preproc directive can occur between any two tokens. I think putting them in extras
is the only correct solution.
@pluick -- It seems comment tokens are returned by tree-sitter, there is even a test case (extra_non_terminals
) in the tree-sitter repo which shows them being returned despite being in extras
. Otherwise, every editor using tree-sitter for syntax highlighting must be using language-specific hacks for highlighting comments...
@maxbrunsfeld are you with the macros (and attributes?) being described as extra
? I can make a PR if approve the idea.
Another option is to implement the directives as a external language, since there are a few languages that share the same syntax (yacc, bison, gcc's md, etc).
It's useful to include some preprocessor directives (like #if
/#else
, etc) in the main grammar, because we want those constructs represented structurally in the syntax tree, which allow you to do things like code folding of #if
in a text editor.
Maybe some other preprocessor directives would be better handled as non-terminal extras though.
I've come up with 2 possible solutions. The original solution I posted above still fixes my issue, and I couldn't find any regressions with it:
preproc_arg: $ => token.immediate(prec(-1, repeat1(/[^\n]|\\\r?\n/))),
Then I tried to come up with something to handle comments after macros (they are currently parsed as a part of the macro).
preproc_arg: $ => token.immediate(prec(-1, repeat1(/[^\n\/]|\/[^*\/]|\\\r?\n/))),
Which parses the examples in #59 correctly, except for one case that I created:
#define MACRO \
{ \
statment = y; /* comment */ \
}
(ERROR (identifier) (preproc_arg) (comment))
I guess the decision goes to @maxbrunsfeld , which change do you want? The one that fixes only this issue or the one with a regression?
I'd just like to point out an example:
void f() {
#if A
if(x) {
#else
if(y) {
#endif
blech();
}
}
There are two versions of the file -- both are syntactically valid, but it's impossible to combine both versions in a way that fits the grammar. I suppose this could be handled by treating the different branches as ambiguities but tree-sitter isn't designed to parse ambiguous grammars, is it?
If a 'perfect' parse is impossible, maybe the preprocessor aspect of the C/C++ grammars should be designed with that in mind.
This conversation looks like it went in another direction, but does #133 fix the original issue?
I'll assume yes as the given example parses fine, do comment if it doesn't though