html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

InnerStartIndex value is wrong in nested elements when sequence is escaped.

Open cmhernandezdel opened this issue 11 months ago • 1 comments

1. Description

InnerStartIndex value is wrong in nested elements when sequence is escaped.

3. Fiddle or Project

Provide a Fiddle that reproduce the issue: https://dotnetfiddle.net/zi7dBK

If you change line 14 to this:

var html = """<p class="test"><span class="text">Hola, soy Carlos. <br> Encantado <a href="#popup">de ayudarte</a>.</span></p>""";

It works.

4. Any further technical details

  • HAP version: 1.11.50.
  • NET version: net6.0.

cmhernandezdel avatar Jul 24 '23 11:07 cmhernandezdel

Well, it's entirely unclear what InnerStartIndex, and OuterStartIndex for that matter are supposed to denote.

Note that the documentation comments for HtmlNode.InnerStartIndex and HtmlNode.OuterStartIndex https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.cs#L561-L564

and

https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlNode.cs#L569-L572

say "Gets the stream position". What does that precisely mean? The wording, especially "stream" is not really aligning well with "parsed HTML document" but rather the source of the parsed HTML document. Then again, no streams involved here in your example -- the source itself being a string -- but look at the HtmlDocument.LoadHtml(string) implementation, and you notice a StringReader being created for the source string, which gives me the impression "stream" here in this context is meant to be "source"...

But it gets more complicated and a bit messy. Note the public HtmlDocument.Text field (without any meaningful documentation comments), which seems to provide the original un-parsed source text and not a representation of the parsed document as HtmlNode.OuterHtml does. See here for an illustrative example: https://dotnetfiddle.net/N9pNLp

InnerStartIndex and OuterStartIndex seem to correspond with the string in HtmlDocument.Text.

And then there is also the HtmlDocument.ParsedText property: https://github.com/zzzprojects/html-agility-pack/blob/2e98e144d8a89c0c0a8a8482fb6c4ee7bfaf0ec8/src/HtmlAgilityPack.Shared/HtmlDocument.cs#L263-L268

which claims to provide the parsed text, but in reality is just a proxy property for HtmlDocument.Text which seems to provide the un-parsed source text. No idea what's up with the HtmlDocument.Text field and the HtmlDocument.ParsedText property and how it is supposed to be, but something isn't right with these two..

elgonzo avatar Jul 24 '23 13:07 elgonzo