tree-sitter-c-sharp
tree-sitter-c-sharp copied to clipboard
XML Doc comment parsing
First attempt at XML doc comment parsing referenced in #18
I'm pretty happy with how this has turned out so far and it's quite in-line with the Roslyn way of parsing the basic types shown in the corpus. Specifically:
xml_elementis working nicely identifying start, end and gobbling up intermediate linesxml_textconsumes the text between the elements or outside them and can span multiple linesxml_empty_elementmatches Roslyn here in naming/being a separate elementxml_cref_attributematching for identity highlighting crefs.xml_text_attributefor the other typesxml_nameused as a wrapper for identifier
I need to go through and consider a few more things before this is ready to review:
Essential
- Can not handle lines that are indented after a
doc_comment... somehowxml_textis consuming them :( - The grammar fails to recover from unclosed xml tags but Roslyn does. This could be a poor editing experience. Need to figure out a way to recover without losing the
xml_elementmulti-line stuff. - Actually check the Roslyn/grammar spec for the various matching and rules here as it's currently based on observation.
xml_text_literal_tokenprobably needs a much better rule
Nice to have
xml_textappears multiple times in sequence where Roslyn emits a single instance. I suspect I need to make thexml_textrule greedier.- Should probably support
/**multi-line XML doc comments as they're technically a thing... I have never seen actually used tho.
Known limitations
- We can't know what type the identifier in the
cnameelement refers to as we have zero context. It could be a parameter name, a class name, etc. We just don't know so this won't be able to match Roslyn for lookup/syntax highlighting etc. without some kind of external symbol matching outside of this grammar. - Will get confused with
//non-doc comments inside some///doc comment blocks such as those with anxml_element- funnily enough so does Visual Studio's syntax visualizer...
I know this is a draft, but just keep in mind that technically it is only a doc-comment if there is exactly 3 slashes. If there are 4 or more slashes in a row, then it is just a normal single line comment. The same logic is there for multi-line comments, where 2 stars are doc-comments, 1 or 3+ is a normal comment.
But really cool to see you take a stab at this, I really had no clue how to approach this in the first place.
Thanks, I'll bear that in mind! Yeah, I had taken at least 2 previous stabs at this and got nowhere. I had forgotten enough of the previous attempts to come at it from a fresh angle.
I'm kinda new to treesitter, but shouldn't the xml doc be considered as a xml injection? Treesitter won't be able to parse more than xml anyway, since it as no symbols awareness.
Expect from reducing duplication and having a simpler implementation, injecting the xml parser would allow tools that do things with xml to work. Like auto closing tags and so on.
The XML doc comments aren't really a full XML system, we don't need to support namespaces etc. A simple subset like this PR would work fine - it's literally just 34 lines.