html-agility-pack
html-agility-pack copied to clipboard
InnerStartIndex value is wrong in nested elements when sequence is escaped.
1. Description
InnerStartIndex
value is wrong in nested elements when sequence is escaped.
3. Fiddle or Project
Provide a Fiddle that reproduce the issue: https://dotnetfiddle.net/zi7dBK
If you change line 14 to this:
var html = """<p class="test"><span class="text">Hola, soy Carlos. <br> Encantado <a href="#popup">de ayudarte</a>.</span></p>""";
It works.
4. Any further technical details
- HAP version: 1.11.50.
- NET version: net6.0.
Well, it's entirely unclear what InnerStartIndex
, and OuterStartIndex
for that matter are supposed to denote.
Note that the documentation comments for HtmlNode.InnerStartIndex and HtmlNode.OuterStartIndex https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.cs#L561-L564
and
https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.cs#L569-L572
say "Gets the stream position". What does that precisely mean? The wording, especially "stream" is not really aligning well with "parsed HTML document" but rather the source of the parsed HTML document. Then again, no streams involved here in your example -- the source itself being a string -- but look at the HtmlDocument.LoadHtml(string) implementation, and you notice a StringReader being created for the source string, which gives me the impression "stream" here in this context is meant to be "source"...
But it gets more complicated and a bit messy. Note the public HtmlDocument.Text
field (without any meaningful documentation comments), which seems to provide the original un-parsed source text and not a representation of the parsed document as HtmlNode.OuterHtml does. See here for an illustrative example: https://dotnetfiddle.net/N9pNLp
InnerStartIndex
and OuterStartIndex
seem to correspond with the string in HtmlDocument.Text.
And then there is also the HtmlDocument.ParsedText
property:
https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlDocument.cs#L263-L268
which claims to provide the parsed text, but in reality is just a proxy property for HtmlDocument.Text which seems to provide the un-parsed source text. No idea what's up with the HtmlDocument.Text field and the HtmlDocument.ParsedText property and how it is supposed to be, but something isn't right with these two..