html-agility-pack
html-agility-pack copied to clipboard
Page not parsed like in browsers
This page http://www.openculture.com/2018/06/break-beats-bars-raps-greats.html contains obvious error - first html tag is not closed with ">". However browsers interpret this properly and add this sign, while html agility pack closes whole html tag there. Would be great if you could emulate the browser behavior on this and similar pages.
Hello @ivanicin,
HAP already emulate some browser behavior but is still not HTML5 fully compliant.
When we will develop the v2.x, we will make sure to be HTML5 compliant to fully emulate browser behavior.
There is still no date targetted for the v2.x
Let me know if that answer correctly to your question.
Best Regards,
Jonathan
If this means that this will be fixed as a part of 2.x, then yes.
However, I am not sure that is the case. The problem is that the HTML code is not proper and your workaround to make it proper is not as good as those applied in browsers. Obviously I am aware that you have much less resources but if you can to make some improvements it would be great.
hi @ivanicin
I can see the original HTML is not close with but what is the problem anyway? I tested that out, I can parse the title and the author name just like I normally do. And the thing about
emulate the browser behavior
Standardized the input HTML can lead to an unexpected outcome, this will mess up your code a lot trust me
@raiytu4
This code is parsed to have only empty document node without any subnodes.
Standardized the input HTML can lead to an unexpected outcome, this will mess up your code a lot trust me
Yes, it is hard, but that's what browsers have been doing since 90s. That made that most of the web sites aren't fully HTML compliant and thus you have to do it if you want to build HTML parser, otherwise it simply wouldn't work on most of the content.