tree-sitter-c-sharp icon indicating copy to clipboard operation
tree-sitter-c-sharp copied to clipboard

XML Doc comment parsing

Open damieng opened this issue 4 years ago • 4 comments

First attempt at XML doc comment parsing referenced in #18

I'm pretty happy with how this has turned out so far and it's quite in-line with the Roslyn way of parsing the basic types shown in the corpus. Specifically:

  • xml_element is working nicely identifying start, end and gobbling up intermediate lines
  • xml_text consumes the text between the elements or outside them and can span multiple lines
  • xml_empty_element matches Roslyn here in naming/being a separate element
  • xml_cref_attribute matching for identity highlighting crefs.
  • xml_text_attribute for the other types
  • xml_name used as a wrapper for identifier

I need to go through and consider a few more things before this is ready to review:

Essential

  1. Can not handle lines that are indented after a doc_comment... somehow xml_text is consuming them :(
  2. The grammar fails to recover from unclosed xml tags but Roslyn does. This could be a poor editing experience. Need to figure out a way to recover without losing the xml_element multi-line stuff.
  3. Actually check the Roslyn/grammar spec for the various matching and rules here as it's currently based on observation.
  4. xml_text_literal_token probably needs a much better rule

Nice to have

  1. xml_text appears multiple times in sequence where Roslyn emits a single instance. I suspect I need to make the xml_text rule greedier.
  2. Should probably support /** multi-line XML doc comments as they're technically a thing... I have never seen actually used tho.

Known limitations

  1. We can't know what type the identifier in the cname element refers to as we have zero context. It could be a parameter name, a class name, etc. We just don't know so this won't be able to match Roslyn for lookup/syntax highlighting etc. without some kind of external symbol matching outside of this grammar.
  2. Will get confused with // non-doc comments inside some /// doc comment blocks such as those with an xml_element - funnily enough so does Visual Studio's syntax visualizer...

damieng avatar Sep 09 '21 22:09 damieng

I know this is a draft, but just keep in mind that technically it is only a doc-comment if there is exactly 3 slashes. If there are 4 or more slashes in a row, then it is just a normal single line comment. The same logic is there for multi-line comments, where 2 stars are doc-comments, 1 or 3+ is a normal comment.

But really cool to see you take a stab at this, I really had no clue how to approach this in the first place.

initram avatar Sep 10 '21 12:09 initram

Thanks, I'll bear that in mind! Yeah, I had taken at least 2 previous stabs at this and got nowhere. I had forgotten enough of the previous attempts to come at it from a fresh angle.

damieng avatar Sep 10 '21 12:09 damieng

I'm kinda new to treesitter, but shouldn't the xml doc be considered as a xml injection? Treesitter won't be able to parse more than xml anyway, since it as no symbols awareness.

Expect from reducing duplication and having a simpler implementation, injecting the xml parser would allow tools that do things with xml to work. Like auto closing tags and so on.

zoriya avatar Mar 30 '22 18:03 zoriya

The XML doc comments aren't really a full XML system, we don't need to support namespaces etc. A simple subset like this PR would work fine - it's literally just 34 lines.

damieng avatar Mar 30 '22 19:03 damieng