tree-sitter-html
tree-sitter-html copied to clipboard
Leading whitespace is lost when it should be part of raw text.
When parsing an element with raw text content (e.g., a script element), leading whitespace, ~~or anything else matching extras is not captured as part of the raw text~~.
For example a element such as
<script> function foo() { return 1; } </script>
results in a (sub)tree such as
script_element [0, 0] - [0, 53]
start_tag [0, 0] - [0, 8]
tag_name [0, 1] - [0, 7]
raw_text [0, 11] - [0, 44]
end_tag [0, 44] - [0, 53]
tag_name [0, 46] - [0, 52]
where the raw text doesn't contain the leading whitespace (i.e., it should be raw_text [0, 7] - [0, 44]). This seems contrary to the treatment of the character tokens which should be emitted per the HTML spec.
See
- https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-incdata
- https://html.spec.whatwg.org/multipage/parsing.html#script-data-state
I think you can have a rule that references a rule declared as an extra. Tree-sitter is fine with it but it won't work with ocaml-tree-sitter, which doesn't have a way to know if a CST node of that kind is an extra or not. What I think should work is create a rule e.g. important_whitespace: $ => /\s+/ other than the one declared in extras: [ ... ].
In the specific case of HTML parsing, I couldn't tell if or when it's reasonable to treat leading and trailing whitespace as significant.
In this case it would seem to me it ought to be part of the raw_text span, instead of some additional rule to match whitespace. In general it doesn't matter, but I believe it ought to be kept in the case of a raw text element, or at least a script element.
@kopecs It may an issue with the lexer rather than the grammar. I don't know or I forgot how the lexer decides between the whitespace extra and raw_text. It chooses the former but we want the latter. The documentation is here. I don't know what precedence is associated with an extra or how to override it. My guess is that's what we want to do here.
I think this is a legitimate question that should be documented or at least answered in a broader context. I suggest asking on tree-sitter Discussions where @maxbrunsfeld or another knowledgeable person could answer.
I was misleading myself somewhat by looking directly at scan_raw_text---apparently external scanners will get called as soon as possible (https://github.com/tree-sitter/tree-sitter/discussions/1771#discussioncomment-2926435).
Looking at the code for the external rules it looks like https://github.com/tree-sitter/tree-sitter-html/blob/29f53d8f4f2335e61bf6418ab8958dac3282077a/src/scanner.cc#L231-L241 is the problem. I think the first while loop should just be moved after the raw text case.