parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Illegal character (<,>,&) in HTML cause xpath extracted value to be empty

Open markbaas opened this issue 9 years ago • 3 comments

As Scrapy is using lxml as xml parser. However, as lxml is an xml parser, characters as <, >, etc are invalid, and then by lxml stripped away. Nevertheless, many website use < and > as less and greater then symbols.

I propose to implement a fix that quote specifically those characters.

markbaas avatar Feb 06 '15 15:02 markbaas

Can you assign this issue to me? I will directly start working on it.

markbaas avatar Feb 06 '15 15:02 markbaas

I propose to use html5lib as html parser. This parser can use etree as treebuilder, so is completely compatible with the current implementation and is able to handle html better than the standard etree HtmlParser. It would require some changes in the selector code though.

markbaas avatar Feb 09 '15 15:02 markbaas

Hey @markbaas , are you still interested in this issue? (I know, it's been quiet for 18 months...) You can check @eliasdorneles 's https://github.com/scrapy/parsel/pull/54 and maybe help it get merged in scrapy/parsel with some feedback. In particular, we could provide a working html5lib-based parser for parsel (as the one shipped with lxml has issues; see https://github.com/scrapy/parsel/pull/54#discussion_r74402632 and https://mailman-mail5.webfaction.com/pipermail/lxml/2016-August/007758.html) I also like your html5 type addition in https://github.com/scrapy/scrapy/pull/1043 , makes sense to me, as lxml's HTML parser is not really html5-ready

redapple avatar Sep 14 '16 17:09 redapple