html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Closing tag is missing within <![CDATA

Open bytecode77 opened this issue 1 year ago • 6 comments
trafficstars

1. Description

Closing tag is missing within <![CDATA object.

Background: This issue surfaced when parsing HTML <code> blocks. The HTML content is given by Confluence, where the HTML is parsed from.

3. Fiddle or Project

https://dotnetfiddle.net/XmcOB6

Expected Result

<![CDATA[<a href="foo">
bar
</a>]]>

Actual Result

<![CDATA[<a href="foo">
bar
]]>

4. Any further technical details

  • HAP version: 1.11.67
  • NET version 7.0

bytecode77 avatar Oct 11 '24 10:10 bytecode77

Hello @bytecode77 ,

~~Looking at Firefox and Chrome, I believe the current actual result from HAP is the right one:~~

FireFox:

image

~~The "comment" section end at the first occurrence of the > found. This means the end tag </a> is considered part of the HTML, but an end tag alone cannot really exist, so it gets removed.~~

~~I'm not exactly sure of all the rules that is currently applied, but the current actual result is indeed what I'm expecting.~~

~~Let me know if that explains the reason correctly.~~

Best Regards,

Jon

JonathanMagnan avatar Oct 11 '24 13:10 JonathanMagnan

Thanks for your quick response Jonathan!

For a comment, I think the end tag is not useful to the DOM. However, the code does not represent a comment:

<code><![CDATA[<a href=\"foo\">\r\nbar\r\n</a>]]></code>

Specifically, this code was retrieved by the Confluence API when reading Confluence pages. I'm not sure what the purpose of <![CDATA[ here is, though.

bytecode77 avatar Oct 11 '24 13:10 bytecode77

However, the code does not represent a comment:

Indeed, you are right. I just assumed this, looking at the result, but that's not the case.

I never had to really use the <![CDATA[ as far as I remember, but looking at what I see on internet, it looks more related to be used within a script tag but not exclusively to this.

At this moment, it still looks like the current behavior looks more like the normal behavior unless I'm proving wrong. Again, I'm not familiar with this tag, so I could definitely be wrong.

Best Regards,

Jon

JonathanMagnan avatar Oct 11 '24 16:10 JonathanMagnan

This is the original HTML that is a Confluence page export that I'm parsing.

It does contain a <![CDATA[ within the text-body of a <ac:structured-macro ac:name="code" and it represents a Confluence code box.

Yes, it was used within <script> tags way before javascript was common to not offend non-supporting browsers. But it remains valid HTML that is, indeed, used:

<ac:structured-macro ac:name="code" ac:schema-version="1" ac:macro-id="70aacf91-111a-4b25-8c3c-543aa6fd0af9">
	<ac:plain-text-body>
		<![CDATA[<a href="foo" target="_blank">
  bar
</a>]]>
	</ac:plain-text-body>

bytecode77 avatar Oct 14 '24 10:10 bytecode77

Hello @bytecode77 ,

Thank you for the additional info. I have looked at the HAP code, and the <!CDATA[ tag is not supported. The current behavior is more a combination of "Comment" and "Text" nodes that show the same result as Firefox.

I would not like to change the current default behavior, but I'm open to looking more at it to support it the way you want through an option that you will need to enable.

I should be able to look more at it later this week

Best Regards,

Jon

JonathanMagnan avatar Oct 15 '24 00:10 JonathanMagnan

Thanks for looking into it, Jon!

I'll stay tuned for your updates :)

bytecode77 avatar Oct 15 '24 05:10 bytecode77

Hello @bytecode77 ,

A new version has been released today: https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.11.68

Could you try the new option and let us know if everything is working as expected.

Best Regards,

Jon

JonathanMagnan avatar Oct 23 '24 16:10 JonathanMagnan

Hi Jonathan!

thank's for providing a new version! However, unfortunately the result is the same as before, see the dotnetfiddle that I posted initially.

To provide a more concrete example: When importing HTML from Confluence, this is what the HTML string looks like:

...
<ac:plain-text-body>
<![CDATA[<iframe src="xxxxxx" width="600" height="1000" frameborder="0">
  <a href="xxxxx" target="_blank">
    xxxxx
  </a>

</iframe>]]>
</ac:plain-text-body>
...

It's a code box showing some HTML.

However, the InnerHtml property is missing the closing tag:

<![CDATA[<iframe src="https://www.terminland.de/hno/?mode=frame" width="600" height="1000" frameborder="0">
  <a href="https://www.terminland.de/hno" target="_blank">
    Online-Terminbuchung
  </a>

]]>

bytecode77 avatar Oct 24 '24 11:10 bytecode77

Hello @bytecode77 ,

Fiddle currently still uses the previous version of HAP and not the latest one (don't ask me why!).

You need to turn on the following options to make it works:

HtmlDocument document = new HtmlDocument();
document.OptionThreatCDataBlockAsComment = true;

I tested both your examples (the first one and the one you just posted), and both seem to work very fine with this option.

Best Regards,

Jon

JonathanMagnan avatar Oct 24 '24 13:10 JonathanMagnan

Ah, I see. However, unfortunately the option OptionThreatCDataBlockAsComment is not included in your latest build. I confirmed this by installing 1.11.68 in a completely blank project on another computer to make sure that this isn't a nuget cache issue of some sort.

bytecode77 avatar Oct 24 '24 13:10 bytecode77

Oh god my bad, It looks like I did a bad release!

The v1.11.69 is now available. I double-checked to ensure I had not made the same mistake twice 😆

Best Regards,

Jon

JonathanMagnan avatar Oct 24 '24 15:10 JonathanMagnan

Let me know if this version was working

JonathanMagnan avatar Oct 24 '24 15:10 JonathanMagnan

Works like a charm! Thanks for your support, Jonathan, and keep up the nice work. I really love HtmlAgilityPack and I've been using it for many years.

Bye!

bytecode77 avatar Oct 25 '24 06:10 bytecode77

@JonathanMagnan Regarding the spelling of OptionThreatCDataBlockAsComment: you probably meant treat and not threat ;-)

wo80 avatar Oct 25 '24 09:10 wo80

You're right, wo80, didn't notice :D

@JonathanMagnan I think if you change the typo and break upward compatibility, there is only one person (me) who would be affected, but that's no problem. When I update some time in the future I will just change the wording on the calling site, too.

bytecode77 avatar Oct 25 '24 09:10 bytecode77

Do'h! You are 100% right @wo80

~It will be changed the next time we release HAP, which will probably happen next month.~

JonathanMagnan avatar Oct 25 '24 13:10 JonathanMagnan

Hello,

The v1.11.70 is now available with the option renamed as proposed by @wo80 ;)

Best Regards,

Jon

JonathanMagnan avatar Oct 25 '24 13:10 JonathanMagnan