parsel
parsel copied to clipboard
Selector mis-parses html when elements have very large bodies
This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case
what is expected
Selector(html).css('h1') should produce all h1 elements within the document
what actually happens
Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.
pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time
try:
from xml.etree import cElementTree as ElementTree
except ImportError:
from xml.etree import ElementTree
bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
<body>
<h1>pre-div h1</h1>
<div>
<h1>pre-script h1</h1>
<p>var bogus = "{}"</p>
<h1>hello I am eclipsed</h1>
</div>
<h1>post-div h1</h1>
</body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
print('h1[{} = {}'.format(idx, h1))
print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
print('h1[{}].txt = <<{}>>'.format(h1, h1.text))
print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
print('h1[{}].txt = <<{}>>'.format(h1, h1.text))
produces the output
Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>
This was actually reported as an issue a while ago at https://github.com/scrapy/scrapy/issues/3077 I just came across it yesterday, and started working on fixing it at the source (lxml)
https://github.com/lxml/lxml/pull/261 was just merged, adding the required functionality to lxml. I guess now we wait for a release to fix this bug?
Apologies, I didn't check the Scrapy issues since I knew if parsel was subject to this behavior then Scrapy was implicitly affected by it.
Thanks so very much for tracking the upstream fix. And I would say given that this bug has likely existed for a while, I doubt it's high urgency.
In addition, I just confirmed that one can build lxml#3fac8341442be711f967ae321d8189df86f81b14, but to close this issue, the parser_cls kwargs will need to be updated to take advantage of the new lxml kwarg; here is one such way:
--- a/parsel/selector.py~ 2018-03-08 23:13:12.000000000 -0800
+++ b/parsel/selector.py 2018-03-08 23:14:41.000000000 -0800
@@ -11,13 +11,20 @@
from .csstranslator import HTMLTranslator, GenericTranslator
+class HugeHTMLParser(html.HTMLParser):
+ def __init__(self, *args, **kwargs):
+ kwargs.setdefault('huge_tree', True)
+ super(HugeHTMLParser, self).__init__(*args, **kwargs)
+
+
class SafeXMLParser(etree.XMLParser):
def __init__(self, *args, **kwargs):
kwargs.setdefault('resolve_entities', False)
super(SafeXMLParser, self).__init__(*args, **kwargs)
+
_ctgroup = {
- 'html': {'_parser': html.HTMLParser,
+ 'html': {'_parser': HugeHTMLParser,
'_csstranslator': HTMLTranslator(),
'_tostring_method': 'html'},
'xml': {'_parser': SafeXMLParser,
edited to include a possible patch
Actually, since XMLParser has the same behavior, and already supports the argument, I think patching create_root_node might be preferable:
--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -39,7 +39,7 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().encode('utf8') or b'<html/>'
- parser = parser_cls(recover=True, encoding='utf8')
+ parser = parser_cls(recover=True, encoding='utf8', huge_tree=True)
root = etree.fromstring(body, parser=parser, base_url=base_url)
if root is None:
root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)
I thought maybe adding this as an optional feature (e.g. keyword argument) might be needed for performance reasons, but I've done some simple testing, and can see no performance difference, so just having it enabled all the time should be fine:
>>> vanilla_parser = lxml.html.HTMLParser()
>>> huge_parser = lxml.html.HTMLParser(huge_tree=True)
>>> def vanilla():
... lxml.html.fromstring(data, parser=vanilla_parser)
...
>>> def huge():
... lxml.html.fromstring(data, parser=huge_parser)
...
>>> timeit.repeat(vanilla, number=1000)
[9.251072943676263, 9.233368926215917, 9.403614949900657]
>>> timeit.repeat(huge, number=1000)
[9.30129877710715, 9.280261006206274, 9.6553337960504]
>>> timeit.repeat(vanilla, number=1000)
[9.392337845172733, 9.374456495046616, 9.257086860947311]
>>> timeit.repeat(huge, number=1000)
[9.3193543930538, 9.24144608201459, 9.242193738929927]
+1 to use huge_tree=True by default, though we need to make a check - parsel must work with old lxml which don't support this parameter.
It may be also good to expose this option somehow for parsel users; some users who do broad crawls may want to limit tree size, as a security/robustness measure.
I dislike version checking and would prefer to just require new lxml with new parsel, but I'm not the one making the decision.
How would lower versions be handled? Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?
I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.
I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:
- lxml supports
huge_treebut it is disabled via the argument (see source code) - lxml doesn't support
huge_treebut it is enabled (either passed or on via default).
How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.
Fixed by #116.