html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Page not parsed like in browsers

Open ivanlabsii opened this issue 6 years ago • 4 comments

This page http://www.openculture.com/2018/06/break-beats-bars-raps-greats.html contains obvious error - first html tag is not closed with ">". However browsers interpret this properly and add this sign, while html agility pack closes whole html tag there. Would be great if you could emulate the browser behavior on this and similar pages.

ivanlabsii avatar Jun 18 '18 12:06 ivanlabsii

Hello @ivanicin,

HAP already emulate some browser behavior but is still not HTML5 fully compliant.

When we will develop the v2.x, we will make sure to be HTML5 compliant to fully emulate browser behavior.

There is still no date targetted for the v2.x

Let me know if that answer correctly to your question.

Best Regards,

Jonathan

JonathanMagnan avatar Jun 18 '18 13:06 JonathanMagnan

If this means that this will be fixed as a part of 2.x, then yes.

However, I am not sure that is the case. The problem is that the HTML code is not proper and your workaround to make it proper is not as good as those applied in browsers. Obviously I am aware that you have much less resources but if you can to make some improvements it would be great.

ivanlabsii avatar Jun 18 '18 17:06 ivanlabsii

hi @ivanicin

I can see the original HTML is not close with but what is the problem anyway? I tested that out, I can parse the title and the author name just like I normally do. And the thing about

emulate the browser behavior

Standardized the input HTML can lead to an unexpected outcome, this will mess up your code a lot trust me

capture

ghost avatar Jun 18 '18 23:06 ghost

@raiytu4

This code is parsed to have only empty document node without any subnodes.

Standardized the input HTML can lead to an unexpected outcome, this will mess up your code a lot trust me

Yes, it is hard, but that's what browsers have been doing since 90s. That made that most of the web sites aren't fully HTML compliant and thus you have to do it if you want to build HTML parser, otherwise it simply wouldn't work on most of the content.

ivanlabsii avatar Jun 19 '18 09:06 ivanlabsii