tree-sitter-html Leading whitespace is lost when it should be part of raw text.

When parsing an element with raw text content (e.g., a script element), leading whitespace, ~~or anything else matching extras is not captured as part of the raw text~~.

For example a element such as

<script>   function foo() { return 1; }     </script>

results in a (sub)tree such as

  script_element [0, 0] - [0, 53]
    start_tag [0, 0] - [0, 8]
      tag_name [0, 1] - [0, 7]
    raw_text [0, 11] - [0, 44]
    end_tag [0, 44] - [0, 53]
      tag_name [0, 46] - [0, 52]

where the raw text doesn't contain the leading whitespace (i.e., it should be raw_text [0, 7] - [0, 44]). This seems contrary to the treatment of the character tokens which should be emitted per the HTML spec.

See

https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-incdata
https://html.spec.whatwg.org/multipage/parsing.html#script-data-state

Jun 08 '22 21:06 kopecs

I think you can have a rule that references a rule declared as an extra. Tree-sitter is fine with it but it won't work with ocaml-tree-sitter, which doesn't have a way to know if a CST node of that kind is an extra or not. What I think should work is create a rule e.g. important_whitespace: $ => /\s+/ other than the one declared in extras: [ ... ].

In the specific case of HTML parsing, I couldn't tell if or when it's reasonable to treat leading and trailing whitespace as significant.

Jun 09 '22 23:06 mjambon

In this case it would seem to me it ought to be part of the raw_text span, instead of some additional rule to match whitespace. In general it doesn't matter, but I believe it ought to be kept in the case of a raw text element, or at least a script element.

Jun 09 '22 23:06 kopecs

@kopecs It may an issue with the lexer rather than the grammar. I don't know or I forgot how the lexer decides between the whitespace extra and raw_text. It chooses the former but we want the latter. The documentation is here. I don't know what precedence is associated with an extra or how to override it. My guess is that's what we want to do here.

I think this is a legitimate question that should be documented or at least answered in a broader context. I suggest asking on tree-sitter Discussions where @maxbrunsfeld or another knowledgeable person could answer.

Jun 10 '22 02:06 mjambon

I was misleading myself somewhat by looking directly at scan_raw_text---apparently external scanners will get called as soon as possible (https://github.com/tree-sitter/tree-sitter/discussions/1771#discussioncomment-2926435).

Looking at the code for the external rules it looks like https://github.com/tree-sitter/tree-sitter-html/blob/29f53d8f4f2335e61bf6418ab8958dac3282077a/src/scanner.cc#L231-L241 is the problem. I think the first while loop should just be moved after the raw text case.

Jun 13 '22 17:06 kopecs

tree-sitter-html tree-sitter-html copied to clipboard

Leading whitespace is lost when it should be part of raw text.

tree-sitter-html
tree-sitter-html copied to clipboard