tree-sitter-c-sharp icon indicating copy to clipboard operation
tree-sitter-c-sharp copied to clipboard

Error on conditional compilation preprocessor directive

Open Sjord opened this issue 4 years ago • 6 comments

var a = new ArrayList()
{
    #if DEBUG
        Capacity = 3
    #else
        Capacity = 4
    #endif
};

Actual:

(compilation_unit [0, 8] - [7, 10]
  (global_statement [0, 8] - [7, 10]
    (local_declaration_statement [0, 8] - [7, 10]
      (variable_declaration [0, 8] - [7, 9]
        type: (implicit_type [0, 8] - [0, 11])
        (variable_declarator [0, 12] - [7, 9]
          (identifier [0, 12] - [0, 13])
          (equals_value_clause [0, 14] - [7, 9]
            (object_creation_expression [0, 16] - [7, 9]
              type: (identifier [0, 20] - [0, 29])
              arguments: (argument_list [0, 29] - [0, 31])
              initializer: (initializer_expression [1, 8] - [7, 9]
                (preprocessor_call [2, 12] - [2, 21]
                  (preprocessor_directive [2, 12] - [2, 15])
                  (identifier [2, 16] - [2, 21]))
                (assignment_expression [3, 16] - [5, 28]
                  left: (identifier [3, 16] - [3, 24])
                  (assignment_operator [3, 25] - [3, 26])
                  right: (assignment_expression [3, 27] - [5, 28]
                    left: (integer_literal [3, 27] - [3, 28])
                    (preprocessor_call [4, 12] - [4, 17]
                      (preprocessor_directive [4, 12] - [4, 17]))
                    (ERROR [5, 16] - [5, 24])
                    (assignment_operator [5, 25] - [5, 26])
                    right: (integer_literal [5, 27] - [5, 28])))
                (preprocessor_call [6, 12] - [6, 18]
                  (preprocessor_directive [6, 12] - [6, 18]))))))))))
foo.cs	0 ms	(ERROR [5, 16] - [5, 24])

Expected:

(compilation_unit [0, 8] - [8, 0]
  (global_statement [0, 8] - [7, 10]
    (local_declaration_statement [0, 8] - [7, 10]
      (variable_declaration [0, 8] - [7, 9]
        type: (implicit_type [0, 8] - [0, 11])
        (variable_declarator [0, 12] - [7, 9]
          (identifier [0, 12] - [0, 13])
          (equals_value_clause [0, 14] - [7, 9]
            (object_creation_expression [0, 16] - [7, 9]
              type: (identifier [0, 20] - [0, 29])
              arguments: (argument_list [0, 29] - [0, 31])
              initializer: (initializer_expression [1, 8] - [7, 9]
                (preprocessor_call [2, 12] - [2, 21]
                  (preprocessor_directive [2, 12] - [2, 15])
                  (identifier [2, 16] - [2, 21]))
                (assignment_expression [3, 16] - [3, 28]
                  left: (identifier [3, 16] - [3, 24])
                  (assignment_operator [3, 25] - [3, 26])
                  right: (integer_literal [3, 27] - [3, 28]))
                (preprocessor_call [4, 12] - [4, 17]
                  (preprocessor_directive [4, 12] - [4, 17]))
                (assignment_expression [5, 16] - [5, 28]
                  left: (identifier [5, 16] - [5, 24])
                  (assignment_operator [5, 25] - [5, 26])
                  right: (integer_literal [5, 27] - [5, 28]))
                (preprocessor_call [6, 12] - [6, 18]
                  (preprocessor_directive [6, 12] - [6, 18]))))))))))

Sjord avatar May 25 '21 09:05 Sjord

We've run into this situation before and our options are limited (unless tree-sitter has changed since last time I discussed it with @maxbrunsfeld ) In fact, there are in fact 3 scenarios in the failing examples test for Newtonsoft's Json.NET that follow a similar pattern.

Basically in order to provide the most amount of syntax highlighting and analysis we interpret both sides of an #if condition. Which means they need to be valid code as if the #if's were removed. The example above is not because there is no comma after the first condition.

This isn't just limited to tree-sitter, text-mate grammars and indeed anything that isn't hooked in to a full compiler or evaluator is going to mess up here. Even just analyzing the file on it's own isn't enough as the conditions are buried in the .csproj most of the time.

Some options I've thought about before:

  1. Ignore #else blocks - would get us further and this is what we used to do a while back. There would be no way for users to toggle or otherwise get insight/syntax highlighting/goto etc. on that second half. It could also easily be tricked/fail by simple things like #if DEBUG/#endif followed by #if !DEBUG/#endif producing an invalid combination.
  2. Have tree-sitter branch the parsing tree when it reached #if conditions to consider each part as a separate continuation of the part before until the #endif. This is non-trivial and tree-sitter would need a new type of node to indicate this branching node.
  3. Have tree-sitter be capable of exposing a list of found tokens and allowing them to be set/unset to see a single specific tree at a time. This would be most like a compiler without having to compile but it would require some evaluation of preprocessor instructions.
  4. Do nothing and provide guidance on how people should use #if conditions to avoid syntax and parsing issues with non-compiler based tools such as this, VSCode, TextMate, Sublime, Emacs, etc.

My favorite is 2 as it would just appear "as magic" and expose all code. Option 3 would feel more like a compiler but it would also be a bit manual/only useful for interactive scenarios and quite invasive.

I don't know if either 2 or 3 fits withing tree-sitters remit or goals.

Here's the PR I sent to the JSON.NET people to simplify their #ifs to avoid the problem entirely https://github.com/JamesNK/Newtonsoft.Json/pull/2400

damieng avatar May 25 '21 11:05 damieng

Yeah, great explanation of the current state @damieng.

@Sjord I'm curious what behavior you think makes sense for your use case at r2c. Our understanding of these preprocessor constructs is always going to be approximate.

In case the reference is helpful, we handle it slightly differently in the C grammar. There, we use separate nodes for modeling #if/#else blocks in a few specific constructs, so that Tree-sitter can understand things like this, parsing them into a full syntax tree.

struct S {
#if A
  int field_1;
#else
  size_t field_1;
#endif
}

int main() {
#ifdef B
  puts("B");
#else
  puts("no B");
#endif
}

This approach still fails in cases where the preprocessor is used in more uncommon ways, but in many cases we get a very good syntax tree that incorporates the preprocessor conditional.

maxbrunsfeld avatar May 25 '21 16:05 maxbrunsfeld

For r2c/semgrep, I think whatever tree-sitter can do without blowing up the grammar. On semgrep-side I think we would probably just keep one of the branch and drop the surrounding directives internally. That means people would not be able to match directives, but I think it's simpler that way for now. If there's really a need to match over directive, we can rediscuss, but in theory you can ifdef any part of the code, including unfinished expressions, so it's very hard to parse.

aryx avatar May 26 '21 10:05 aryx

On a related note, I was tracking down an issue with textmate grammar which GitHub is currently using, and noticed that tree-sitter also doesn't handle this case:

class A
{
    static bool B() => Condition
#if X
            ? XOne()
            : XTwo();
#else
            ? YOne()
            : YTwo();
#endif
}

the highlighting is off after the #else directive. tree-sitter produces:

compilation_unit [0, 0] - [11, 0]
  class_declaration [0, 0] - [10, 1]
    name: identifier [0, 6] - [0, 7]
    body: declaration_list [1, 0] - [10, 1]
      method_declaration [2, 4] - [5, 21]
        modifier [2, 4] - [2, 10]
        type: predefined_type [2, 11] - [2, 15]
        name: identifier [2, 16] - [2, 17]
        parameters: parameter_list [2, 17] - [2, 19]
        body: arrow_expression_clause [2, 20] - [5, 20]
          conditional_expression [2, 23] - [5, 20]
            condition: identifier [2, 23] - [2, 32]
            preprocessor_call [3, 0] - [3, 5]
              preprocessor_directive [3, 0] - [3, 3]
              identifier [3, 4] - [3, 5]
            consequence: invocation_expression [4, 14] - [4, 20]
              function: identifier [4, 14] - [4, 18]
              arguments: argument_list [4, 18] - [4, 20]
            alternative: invocation_expression [5, 14] - [5, 20]
              function: identifier [5, 14] - [5, 18]
              arguments: argument_list [5, 18] - [5, 20]
      preprocessor_call [6, 0] - [6, 5]
        preprocessor_directive [6, 0] - [6, 5]
      method_declaration [6, 5] - [8, 21]
        type: nullable_type [6, 5] - [7, 13]
          identifier [6, 5] - [6, 5]
        name: identifier [7, 14] - [7, 18]
        ERROR [7, 18] - [8, 18]
          parameter_list [7, 18] - [7, 20]
        parameters: parameter_list [8, 18] - [8, 20]
      preprocessor_call [9, 0] - [9, 6]
        preprocessor_directive [9, 0] - [9, 6]

am11 avatar Jul 27 '21 11:07 am11

Yes, this is the same underlying cause and limitation. There is no way anything other than a compiler can definitively resolve that kind of fragmented condition - you would need to evaluate all the preprocessor conditions to know which paths are valid.

damieng avatar Jul 27 '21 20:07 damieng

That said, it is my intention to adopt something similar to what the C grammar is doing that Max described above which will at least let us parse/recover even if we don't definitively know what was intended (I'm just wrapping up a paid project and hope to have some time to devote to tree-sitter-c-sharp again in the next week or so)

damieng avatar Jul 27 '21 20:07 damieng