Reqnroll Add initial lossless syntax tree implementation

🤔 What's changed?

This adds a new project Reqnroll.CodeAnalysis.Gherkin which provides a lossless syntax tree and a parser implementation intended for use with code generation, code-analysis and code-fix implementations operating over the Reqnroll flavour of the Gherkin language.

The syntax tree follows a fundamental rule that all input text is represented in the output tree:

Syntactically-significant text becomes syntax tokens.
Other text (such as whitespace) becomes syntax trivia and is attached to tokens.

Syntax tokens are grouped by syntax nodes which form the higher-order syntax structure of a feature file.

This initial implementation intends to fix the public API. The internal implementation will have room for performance improvements.

The most important detail is that all tokens, trivia and nodes are immutable and are represented in two flavours:

An internal "raw" flavour which contains the text of the token and its associated data.
A public flavour which wraps a raw token, adding references to parent tokens and explicit positional information.

This model makes it very efficient to create nodes outside of a syntax tree, then insert them into one or more syntax trees, or copy nodes between syntax trees - the raw nodes do not maintain a reference to their parents, so they can appear in multiple hierarchies. This will also make it considerably easier and efficient to implement any kind of incremental parsing in the future, where nodes that are unaffected by a source change can be copied into the new tree.

⚡️ What's your motivation?

This model will provide the specific positional information necessary to emitted code statements in Roslyn generators to their Gherkin source. It will also make it easier to implement more complex transforms, such as embedding the source document into Gherkin Messages. Finally, it will provide a simple and efficient model for driving incremental generation, which can be improved by adding incremental parsing as a future enhancement without needing to change the generator logic.

🏷️ What kind of change is this?

:zap: New feature (non-breaking change which adds new behaviour)

♻️ Anything particular you want feedback on?

📋 Checklist:

[ ] I've changed the behaviour of the code
- [ ] I have added/updated tests to cover my changes.
[ ] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
[ ] Users should know about my change
- [ ] I have added an entry to the "[vNext]" section of the CHANGELOG, linking to this pull request & included my GitHub handle to the release contributors list.

This text was originally taken from the template of the Cucumber project, then edited by hand. You can modify the template here.

Feb 11 '25 02:02 Code-Grump

I've created this PR at this stage for feedback on what exists so far. I would suggest the following types should be given the most attention:

SyntaxNode and RawSyntaxNode
SyntaxToken and RawSyntaxToken
SyntaxTrivia and RawSyntaxTrivia
SyntaxFactory and InternalSyntaxFactory
SyntaxTree
ParsedSyntaxTreeBuilder

The codebase intends to emulate the Microsoft.CodeAnalysis concepts for familiarity, using the same concepts of tokens, trivia and syntax nodes generally. The public API should not be considered complete (there are many [obvious additions that could be made) but should be considered representative of the core behaviour desired by the code generators.

The current implementation is able to parse a feature file with a feature declaration, but no scenarios. I've added some support for handling errors, but there's a lot more work that needs to be done in that space to handle every kind of failure.

Through creating the structures for representing a feature file and the single feature declaration, I can see a lot of opportunity to generate code. Once I get a sense that we're happy with this structure and would like to proceed, I'll investigate how we might generate all the desired structures following the same patterns.

Feb 11 '25 02:02 Code-Grump

Thank you. I like the approach of making small steps towards our generation goals. 🤘

To get a design overview of the change, do I see it correctly, that this change basically uses the Gherkin parser, but provides an alternative AST builder and token scanner/matcher that fits to the purpose, correct?

Also if I understand correctly this is now an alternative parser that is currently not being used yet, but would be used for Roslyn code gen and also potentially for the legacy CodeDom based generation as well. How do you see that?

I'm very bad in reviewing things that does not yet have a real usage. 😎 Would it be possible to add some simplistic tool (maybe a basic linter or whatever that is the easiest) that could demonstrate the use of the new parser? It does not have to be something we release. That would give probably a few good input for the design (at least for me).

Feb 11 '25 07:02 gasparnagy

To get a design overview of the change, do I see it correctly, that this change basically uses the Gherkin parser, but provides an alternative AST builder and token scanner/matcher that fits to the purpose, correct?

That's right: there's an implementation of IAstBuilder which composes a tree using a handler class for each rule type.

Also if I understand correctly this is now an alternative parser that is currently not being used yet, but would be used for Roslyn code gen and also potentially for the legacy CodeDom based generation as well. How do you see that?

It's intended to be used for Roslyn-based generators, but could be used in any generation space. It could also be very valuable in all IDE plug-ins, where performing code refactoring would be possible by modifying the syntax tree and writing the result back to the source file: because we preserve whitespace, changes respect the source document format and can be performed with a surgical precision.

I'm very bad in reviewing things that does not yet have a real usage. 😎 Would it be possible to add some simplistic tool (maybe a basic linter or whatever that is the easiest) that could demonstrate the use of the new parser? It does not have to be something we release. That would give probably a few good input for the design (at least for me).

Naturally, I could create a Roslyn-based generator that consumes the parsed result, but that only reads the tree and projects into another format. If you want something that performs more complex analysis or modification, I could imagine something up or would be happy to have some concrete tasks.

Feb 11 '25 10:02 Code-Grump

Naturally, I could create a Roslyn-based generator that consumes the parsed result, but that only reads the tree and projects into another format. If you want something that performs more complex analysis or modification, I could imagine something up or would be happy to have some concrete tasks.

I see. The goal would be to get an impression on how easy it is to work and manipulate the AST. But this sample usage is only throw-away code, so we should not put much effort into that. Maybe we could make one that transforms the feature file to a markdown document. It doesn't have to be wrapped to a Roslyn-based generator, but you have mentioned "incremental parsing" and I have no idea how we could verify that (I mean whether the design really supports incremental parsing).

But just guessing here. We can even drop this "sample usage" idea if it is too overwhelming.

Feb 11 '25 10:02 gasparnagy

I've dropped in a single example into the associated test project which demonstrates a hypothetical analysis: detecting a lack of a blank line between the feature headline and its associated description. It doesn't go into detail of how you could provide a "fix" for this error, but one could be provided if useful.

Feb 12 '25 01:02 Code-Grump

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Firstly, the raw nodes do not have any concept of "ownership"; they can be aggregated by multiple parent nodes to appear in multiple valid syntax trees simultaneously. Combined with their immutability, this makes them safe to be re-used between parses.

Secondly, equality checks in the public syntax tree are made fast by doing reference-based checks on the raw nodes. This means that if we were to re-use nodes between incremental parses, a consumer is able to spot just the new nodes very cheaply. This is the primary benefit of incremental parsing for the consumer: an ability to detect changes and only process a subset of the total tree.

Additionally, because the tree includes all input text, it makes it trivial to understand which portion of the tree is directly affected by a change in source text: we just get the span that has been modified and find what nodes that span crosses; all nodes touched have to be parsed again and untouched nodes are safe to re-use.

Feb 12 '25 11:02 Code-Grump

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Maybe it's not relevant, but how is this related to line positions? What happens if an extra line inserted to the top of the file (so everything is shifted)? Will everything else below become "invalid"?

Feb 12 '25 11:02 gasparnagy

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Maybe it's not relevant, but how is this related to line positions? What happens if an extra line inserted to the top of the file (so everything is shifted)? Will everything else below become "invalid"?

Great question! This scenario invalidates all touched positions, but explicit positions are only in the public model: they are calculated from the width of raw tokens and trivia.

This means we can look at a change like this, process a content insertion, and re-use the raw nodes which have been "pushed around". This model can re-use individual tokens or whole subtrees whilst accepting surrounding content changes.

Feb 12 '25 14:02 Code-Grump

If possible, I'd like to know if there's any feedback on the overall design at this stage.

My next step is to implement diagnosis support and then I want to code-generate the syntax class (since the code is all boilerplate,) so it'd be great to understand if there's anything to correct or present alternatives for.

Feb 16 '25 23:02 Code-Grump

Thanks again for the quality feedback @gasparnagy - getting direction is what I need to move this forward.

Regarding the terminology you've inquired into, let me see if I can clarify:

Trivia - Anything that's not syntax is trivia. It's an attempt to separate the things that are the syntax structure "meat" of a code file from the noise around it. Comments and whitespace are the two biggest examples. Trivia is still considered important to the parser (because the presence or absence of whitespace changes the text and therefore the syntax) but it's considered less-important than the syntax itself.
Structured Trivia - Sometimes trivia is not just plain text, but has some structure with meaning. In C# something like a #pragma directive would be treated as structured trivia. For Gherkin, the language comment definitely counts. It also has a secondary purpose to contain skipped nodes: nodes (and their associated structure) that could not be interpreted by the parser. We must still include them (as all input text must appear in the output), but cannot include them as valid syntax.
Syntax Trivia - A more formal name for trivia.
Structured Trivia Syntax - The special syntax node which can be embedded as trivia.
Syntax Node vs Raw Node - SyntaxNode is the public face of structured nodes in the tree. All nodes which contain children (either more structured nodes or tokens) inherit from this class. RawNode is a type which represents a point in the internal tree structure. All internal tree points inherit from RawNode.
Syntax vs Internal.*Syntax - Each pair represents the public face and the internal implementation of some kind of structure in the tree. I struggled with naming the whole Internal namespace and would welcome alternative names.
*Syntax - As you said, these are following the naming conventions from the Roslyn tree. I'm not sold on them as being the "correct" way to do this, but there is a nice simplicity in seeing a type called FeatureDeclarationSyntax and being able to infer "this must represent the syntax of declaring a feature." Roslyn's error may be that SyntaxNode is incorrectly named, as it does not represent the base for all nodes in the tree, only those which are structured nodes. We are not required to follow this precedent and it's merely done here to help draw parallels.
Syntax Token - The type SyntaxToken is meant to represent an atomic span of text that is part of the core syntax. It wraps a RawNode like every other point in the tree, but it's a structure for performance reasons and that excludes it from inheriting from SyntaxNode.
Raw Syntax Token - The implementation of a syntax token in the internal tree.
Slot - This one is completely borrowed from the Roslyn implementation. In structured elements (that's structured syntax and lists), there needs to be common mechanism to gain access to the child items that "slot" into the spaces provided by those structures. It's naming is intended to make clear that if you ask for the contents of a slot, you will get a null value to indicate the slot is empty. This differs from asking for "children" which will only give you the children that actually exist. Additionally, most structures have a fixed number of slots determined by the type of structure, rather than being an expanding collection.
leading / trailing trivia - This one seems arbitrary, but makes sense the more you think about it. When parsing the source text, leading trivia is all the whitespace, comments and other trivia elements that exist before a token. Once the token is reached, that buffered trivia is attached to that token. The exception to this rule is that whitespace at the end of a line is appended as trailing trivia to the last token on that line. The intended effect is that if you were to remove the syntax from the tree, you would conveniently get all the line-breaks and comments that lead up to the syntax which typically exist to support the syntax, as well as the end-of-line that happens to terminate the syntax. Additionally, trivia is only attached to tokens, however you can ask any node for its leading and trailing trivia and it will get the trivia attached to the first and last token it contains in document order.

I hope this helps clarify and generates further feedback. 😁

Feb 20 '25 16:02 Code-Grump

I hope this helps clarify and generates further feedback.

That is definitely helpful. 🙏 I have just noted that you keep using the term "syntax" even in normal text for something that only makes sense in the Roslyn bubble, and I continuously need to transcribe. 😵‍💫 So for example when you say "Anything that's not syntax is trivia." that is in a "normal" language does not make sense, because trivia (whitespace, comments, etc) are do part of the syntax of a language! But this just shows how bad idea was from MS to shorten "normal language syntax node" to "syntax". We'll get used to it. (The normal language syntax term is taken from https://github.com/Microsoft/TypeScript/issues/1678, where they finally describe trivia as "Syntax trivia represent the parts of the source text that are largely insignificant for normal understanding of the code")

For naming the Internal/Not Internal "*Syntax" classes, my suggestion would be to use *InternalSyntax for the internal ones, like FeatueInternalSyntax. Not nice, but at least avoids always checking the proper namespaces.

Feb 21 '25 10:02 gasparnagy