Handling of preprocessor macros is not general enough
While looking into how one could tackle zeek/tree-sitter-zeek#6 I looked into this grammar for inspiration and noticed that it has similar issues. In C or C++ preprocessor macros can appear around pretty much any token of the language while this grammar only allows for it in a couple of places. I wonder what the best approach to this would be.
As an example, the following source file
int
#if 0
foo
#else
main
#endif
(void) {}
produces this AST
(translation_unit
(ERROR
(primitive_type))
(preproc_if
(number_literal)
(ERROR
(identifier))
(preproc_else)
(ERROR
(identifier)))
(expression_statement
(compound_literal_expression
(type_descriptor
One could come up with nastier examples where e.g., an opening parenthesis is inside a preprocessor block. I am not even sure how the resulting AST should look like, but I feel like I might want something which can support preprocessor directives anywhere, but with more structure than what is extras is typically used for. Would there be a way to support this with an external scanner?
There is also already #13, but it seems to be more focussed on improving the the handling of currently supported special cases.
Here's a typical scenario where we come across this problem.
#ifdef __cplusplus
extern "C" {
#endif
#ifdef __cplusplus
}
#endif
AST: https://tree-sitter.github.io/tree-sitter/playground#
translation_unit [0, 0] - [8, 0]
preproc_ifdef [0, 0] - [6, 6]
name: identifier [0, 7] - [0, 18]
linkage_specification [1, 0] - [5, 1]
value: string_literal [1, 7] - [1, 10]
string_content [1, 8] - [1, 9]
body: declaration_list [1, 11] - [5, 1]
preproc_call [2, 0] - [3, 0] <<<<<<<<< 🧐
directive: preproc_directive [2, 0] - [2, 6]
preproc_ifdef [4, 0] - [4, 18]
name: identifier [4, 7] - [4, 18]
MISSING #endif [4, 18] - [4, 18] <<<<<<<<< 🧐
Another example from the CMakeCXXCompilerId.cpp file that CMake generates during a build (yes, C++ source file, but also valid C):
char const info_version[] = {
'I', 'N', 'F', 'O', ':',
'c','o','m','p','i','l','e','r','_','v','e','r','s','i','o','n','[',
COMPILER_VERSION_MAJOR,
# ifdef COMPILER_VERSION_MINOR
'.', COMPILER_VERSION_MINOR,
# ifdef COMPILER_VERSION_PATCH
'.', COMPILER_VERSION_PATCH,
# ifdef COMPILER_VERSION_TWEAK
'.', COMPILER_VERSION_TWEAK,
# endif
# endif
# endif
']','\0'};
which gives error nodes:
translation_unit [0, 0] - [15, 0]
declaration [0, 0] - [13, 12]
type: primitive_type [0, 0] - [0, 4]
type_qualifier [0, 5] - [0, 10]
declarator: init_declarator [0, 11] - [13, 11]
declarator: array_declarator [0, 11] - [0, 25]
declarator: identifier [0, 11] - [0, 23]
value: initializer_list [0, 28] - [13, 11]
char_literal [1, 2] - [1, 5]
character [1, 3] - [1, 4]
char_literal [1, 7] - [1, 10]
character [1, 8] - [1, 9]
char_literal [1, 12] - [1, 15]
character [1, 13] - [1, 14]
char_literal [1, 17] - [1, 20]
character [1, 18] - [1, 19]
char_literal [1, 22] - [1, 25]
character [1, 23] - [1, 24]
char_literal [2, 2] - [2, 5]
character [2, 3] - [2, 4]
char_literal [2, 6] - [2, 9]
character [2, 7] - [2, 8]
char_literal [2, 10] - [2, 13]
character [2, 11] - [2, 12]
char_literal [2, 14] - [2, 17]
character [2, 15] - [2, 16]
char_literal [2, 18] - [2, 21]
character [2, 19] - [2, 20]
char_literal [2, 22] - [2, 25]
character [2, 23] - [2, 24]
char_literal [2, 26] - [2, 29]
character [2, 27] - [2, 28]
char_literal [2, 30] - [2, 33]
character [2, 31] - [2, 32]
char_literal [2, 34] - [2, 37]
character [2, 35] - [2, 36]
char_literal [2, 38] - [2, 41]
character [2, 39] - [2, 40]
char_literal [2, 42] - [2, 45]
character [2, 43] - [2, 44]
char_literal [2, 46] - [2, 49]
character [2, 47] - [2, 48]
char_literal [2, 50] - [2, 53]
character [2, 51] - [2, 52]
char_literal [2, 54] - [2, 57]
character [2, 55] - [2, 56]
char_literal [2, 58] - [2, 61]
character [2, 59] - [2, 60]
char_literal [2, 62] - [2, 65]
character [2, 63] - [2, 64]
char_literal [2, 66] - [2, 69]
character [2, 67] - [2, 68]
identifier [3, 2] - [3, 24]
ERROR [4, 0] - [4, 30]
identifier [4, 8] - [4, 30]
char_literal [5, 2] - [5, 5]
character [5, 3] - [5, 4]
identifier [5, 7] - [5, 29]
ERROR [6, 0] - [6, 31]
identifier [6, 9] - [6, 31]
char_literal [7, 3] - [7, 6]
character [7, 4] - [7, 5]
identifier [7, 8] - [7, 30]
ERROR [8, 0] - [8, 32]
identifier [8, 10] - [8, 32]
char_literal [9, 4] - [9, 7]
character [9, 5] - [9, 6]
identifier [9, 9] - [9, 31]
ERROR [10, 0] - [12, 7]
preproc_directive [10, 0] - [10, 9]
char_literal [13, 2] - [13, 5]
character [13, 3] - [13, 4]
char_literal [13, 6] - [13, 10]
escape_sequence [13, 7] - [13, 9]
Here is another example in the same vein. This code
if (true)
#define BLAH
return;
produces
(translation_unit [0, 0] - [3, 0]
(if_statement [0, 0] - [0, 9]
condition: (parenthesized_expression [0, 3] - [0, 9]
(true [0, 4] - [0, 8]))
consequence: (expression_statement [0, 9] - [0, 9]))
(preproc_def [1, 4] - [2, 0]
name: (identifier [1, 12] - [1, 16]))
(return_statement [2, 4] - [2, 11]))
But both the preproc_defand the return_statement should be children of the if_statement.