html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Original HTML contains carriage return line feeds within attributes that are seemingly not preserved in HtmlNode.OuterHtml

Open rfreas opened this issue 7 years ago • 8 comments

It seems to be the case that carriage return line feeds located at certain places within the original HTML document (like within a style attribute) are not preserved after parsing which causes the HtmlNode.OuterLength property be different than expected, i.e., in this example, off by two due to the length of a CRLF.

Here's an HTML fragment that demonstrates the problem:

<p style= "margin-top:18pt; margin-bottom:0pt; font-size:9pt; font-family:Arial"> <b><a name="toc17062_1" id="toc17062_1"></a>PART I</b></p>

In the above snippet, the CRLFs are not displaying here and snippet is wrapping the way it is because it's contained within a code block here, but there are in fact two CRLFs in this HTML fragment. The first CRLF follows the <p style= text and is omitted from the HtmlNode.OuterHtml property. The second CRLF follows the Arial"> text and is preserved. It is difficult to see here because of the code formatting, but you may observe the single whitespace following the style= and Arial"> parts of the HTML fragment which represents the CRLFs. The second CRLF is not omitted from the HtmlNode.OuterHtml property, but the first one is and I wonder if it is because the CRLF occurs within an attribute as opposed to the end of a closing tag.

Does anyone know what might be going on with this?

Thanks,

Rob

rfreas avatar Aug 31 '18 21:08 rfreas

@JonathanMagnan Any idea what might be going on with this?

rfreas avatar Sep 10 '18 14:09 rfreas

Hello @rfreas ,

Sorry, we missed this one.

We reproduced your scenario and everything look fine for us.

string html = @"<p style=
""margin-top:18pt; margin-bottom:0pt; font-size:9pt; font-family: Arial 
""> <b><a name=""toc17062_1"" id=""toc17062_1""></a>PART I</b></p>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var outerHtml = doc.DocumentNode.OuterHtml;

The first CR is omitted since that's the attribute name, the second CR is not omitted since that's a string. Anything before the end of the string will be included in the attribute value.

Are you getting an error?

Best Regards,

Jonathan

JonathanMagnan avatar Sep 10 '18 20:09 JonathanMagnan

Thanks for the reply @JonathanMagnan!

So the issue is that because the first CR (or CRLF) is omitted, the length of the OuterHtml is inaccurate at a byte level because it doesn't reflect that dropped CR/CRLF. In our application, start and end index positions are important for determine the start and end points of content within the HTML and because of this issue, our indexes are off because we're using HtmlNode.OuterHtml.Length and the omitted CR/CRLF is affecting the indexes at a byte count level.

How might we resolve this or work around it?

Rob

rfreas avatar Sep 11 '18 15:09 rfreas

Unfortunately, I'm afraid there is no so much we can do.

Even if we succeed to fix it since you look to use start and end index position outside of our library, you will run in some other problem as the parse might add some missing tag or characters or other issues like this you will eventually run into.

JonathanMagnan avatar Sep 11 '18 17:09 JonathanMagnan

That is most unfortunate. We're walking the HTML DOM sending text through NLP models, scoring and categorizing text and as we find elements of interest, we need to be able to store absolute position indexes to specific pieces of content. With this issue, we can't do that accurately because these bytes are being stripped out of the OuterHtml property and therefore affects the Length property.

If you have suggestions, I would welcome them and in any case, I appreciate your assistance!

Rob

rfreas avatar Sep 13 '18 15:09 rfreas

I'm not sure to exactly why you need to store absolute position in HAP instead of original content but that's for sure not a good long-term solution even without this issue. It forces you to never upgrade your version since the parser gets fixes from time to times that may change the output.

JonathanMagnan avatar Sep 13 '18 17:09 JonathanMagnan

Consider this simple example:

<div>
    <p style=
"margin-top:0px;margin-bottom:0px"><b>Some text</b></p>
</div>

How do I calculate the StreamPosition of the first byte after the ">" on the closing DIV tag? We are trying to use HtmlNode.StreamPosition + HtmlNode.OuterHtml.Length to calculate it, but it is unreliable since the two byte CRLF after the equals sign of the style attribute in the P tag is being stripped out by HAP. This results in a length value that is short by 2 due to the removal of the CRLF by HAP in its internal objects, but is actually present in the original HTML content.

Thanks @JonathanMagnan.

Rob

rfreas avatar Sep 14 '18 14:09 rfreas

I'm facing this same issue - unreliable OuterLength property makes HAP problematic because it cannot be used to map an element to a range of characters in the input.

gbpcor avatar Apr 26 '24 15:04 gbpcor