tree-sitter-c Support parsing unterminated statements

Currently parser is only able to successfully parse terminated statements, like:

const char * myarray[25];

But if you feed something like

const char * [25]

or

const char *

It emits an error. It would be beneficial to support parsing such statements too. @thestr4ng3r proposed the following change in the grammar:

diff --git a/grammar.js b/grammar.js
index 6a5fa25..5dc99a3 100644
--- a/grammar.js
+++ b/grammar.js
@@ -51,6 +51,7 @@ module.exports = grammar({
   word: $ => $.identifier,

   rules: {
+    the_actual_root: $ => $.type_descriptor,
     translation_unit: $ => repeat($._top_level_item),

     _top_level_item: $ => choice(

Apr 27 '21 04:04 XVilka

To clarify, what we actually need is in addition to parsing a full translation_unit like int a() { x = (const char *[25])y; }, in the same application parse only the part inside the cast like const char *[25], which is type_descriptor.

So essentially, we would need a way to change the root rule to use at runtime, which isn't really tree-sitter-c specific. I wonder if that is even theoretically possible with how the code generator works.

Alternatively we will have to use two grammars for this where one is the original tree-sitter-c and the other is conceptually what is shown in the issue description (which of course breaks parsing regular translation_units).

Apr 27 '21 05:04 thestr4ng3r

Maybe worth to transfer the issue to the tree-sitter repository then? @maxbrunsfeld

May 06 '21 08:05 XVilka

When you need to parse a fragment of incomplete source code (like a type_descriptor), can you just surround the fragment with a "context" that turns it into a valid C translation unit, and then extract out the piece of the syntax tree that you're interested in?

For example, to parse a type_descriptor, take the input string, append the suffix string x;, parse that combined string, and then take the subtree for the relevant byte range.

There is a long-standing Tree-sitter issue about selecting alternative root rules at runtime, but that is going to be complex to implement, and this workaround actually seems quite straightforward and scalable, in cases where you had many different rules that you wanted to try.

May 13 '21 16:05 maxbrunsfeld

Appending x; would not work for type_descriptor:

char *x;

(translation_unit [0, 0] - [1, 0]
  (declaration [0, 0] - [0, 8]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (pointer_declarator [0, 5] - [0, 7]
      declarator: (identifier [0, 6] - [0, 7]))))

But we could in theory use a cast, so assuming we want to parse const char *[42], wrap it like so:

void a() { (const char *[42])x; }

(translation_unit [0, 0] - [1, 0]
  (function_definition [0, 0] - [0, 33]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (function_declarator [0, 5] - [0, 8]
      declarator: (identifier [0, 5] - [0, 6])
      parameters: (parameter_list [0, 6] - [0, 8]))
    body: (compound_statement [0, 9] - [0, 33]
      (expression_statement [0, 11] - [0, 31]
        (cast_expression [0, 11] - [0, 30]
          type: (type_descriptor [0, 12] - [0, 28]
            (type_qualifier [0, 12] - [0, 17])
            type: (primitive_type [0, 18] - [0, 22])
            declarator: (abstract_pointer_declarator [0, 23] - [0, 28]
              declarator: (abstract_array_declarator [0, 24] - [0, 28]
                size: (number_literal [0, 25] - [0, 27]))))
          value: (identifier [0, 29] - [0, 30]))))))

The reason why in practice we can't do this is that the string that we want to parse could do some sort of injection and easily escape our wrapping, for example when we try to parse int)0; now_i_have_escaped(); //, we want to get a meaningful error rather than a well-parsed int type_descriptor with some garbage in the wrapped tree.

But I think the first workaround proposed in https://github.com/tree-sitter/tree-sitter/issues/870, which is to always prepend some magic string to tell the parser how to proceed could work very well for us.

May 13 '21 17:05 thestr4ng3r

Just for the record, this is what I came up with:

    [$._type_specifier, $._expression],
    [$._type_specifier, $._expression, $.macro_type_specifier],
    [$._type_specifier, $.macro_type_specifier],
+   [$.type_expression, $._abstract_declarator],
+   [$.type_expression],
    [$.sized_type_specifier],
  ],

  word: $ => $.identifier,

  rules: {
-    translation_unit: $ => repeat($._top_level_item),
+    translation_unit: $ => choice(
+            repeat1($.type_expression),
+            repeat1($._top_level_item)
+    ),
+
+    type_expression: $ => seq(
+       '__TYPE_EXPRESSION',
+       repeat($.type_qualifier),
+       field('type', $._type_specifier),
+       repeat($.abstract_pointer_declarator),
+       repeat($.abstract_array_declarator),
+       repeat($.abstract_pointer_declarator),
+    ),

You can see the examples of what it can parse here: https://github.com/XVilka/tree-sitter-c/commit/fed7bd082a234d2aad76942210571f6361c32697:

__TYPE_EXPRESSION const int* [5]
__TYPE_EXPRESSION volatile uint8_t* [2]
__TYPE_EXPRESSION const uintptr_t* []
__TYPE_EXPRESSION struct s1 *

__TYPE_EXPRESSION struct s2 {
  int x;
  float y : 5;
} [5]

May 14 '21 09:05 XVilka

tree-sitter-c tree-sitter-c copied to clipboard

Support parsing unterminated statements

tree-sitter-c
tree-sitter-c copied to clipboard