parsel
parsel copied to clipboard
Illegal character (<,>,&) in HTML cause xpath extracted value to be empty
As Scrapy is using lxml as xml parser. However, as lxml is an xml parser, characters as <, >, etc are invalid, and then by lxml stripped away. Nevertheless, many website use < and > as less and greater then symbols.
I propose to implement a fix that quote specifically those characters.
Can you assign this issue to me? I will directly start working on it.
I propose to use html5lib as html parser. This parser can use etree as treebuilder, so is completely compatible with the current implementation and is able to handle html better than the standard etree HtmlParser. It would require some changes in the selector code though.
Hey @markbaas , are you still interested in this issue? (I know, it's been quiet for 18 months...)
You can check @eliasdorneles 's https://github.com/scrapy/parsel/pull/54 and maybe help it get merged in scrapy/parsel with some feedback.
In particular, we could provide a working html5lib-based parser for parsel (as the one shipped with lxml has issues; see https://github.com/scrapy/parsel/pull/54#discussion_r74402632 and https://mailman-mail5.webfaction.com/pipermail/lxml/2016-August/007758.html)
I also like your html5
type addition in https://github.com/scrapy/scrapy/pull/1043 , makes sense to me, as lxml's HTML parser is not really html5-ready