parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Selector mis-parses html when elements have very large bodies

Open mdaniel opened this issue 7 years ago • 7 comments

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.


pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

mdaniel avatar Mar 03 '18 21:03 mdaniel

This was actually reported as an issue a while ago at https://github.com/scrapy/scrapy/issues/3077 I just came across it yesterday, and started working on fixing it at the source (lxml)

stranac avatar Mar 08 '18 20:03 stranac

https://github.com/lxml/lxml/pull/261 was just merged, adding the required functionality to lxml. I guess now we wait for a release to fix this bug?

stranac avatar Mar 08 '18 23:03 stranac

Apologies, I didn't check the Scrapy issues since I knew if parsel was subject to this behavior then Scrapy was implicitly affected by it.

Thanks so very much for tracking the upstream fix. And I would say given that this bug has likely existed for a while, I doubt it's high urgency.

In addition, I just confirmed that one can build lxml#3fac8341442be711f967ae321d8189df86f81b14, but to close this issue, the parser_cls kwargs will need to be updated to take advantage of the new lxml kwarg; here is one such way:

--- a/parsel/selector.py~	2018-03-08 23:13:12.000000000 -0800
+++ b/parsel/selector.py	2018-03-08 23:14:41.000000000 -0800
@@ -11,13 +11,20 @@
 from .csstranslator import HTMLTranslator, GenericTranslator


+class HugeHTMLParser(html.HTMLParser):
+    def __init__(self, *args, **kwargs):
+        kwargs.setdefault('huge_tree', True)
+        super(HugeHTMLParser, self).__init__(*args, **kwargs)
+
+
 class SafeXMLParser(etree.XMLParser):
     def __init__(self, *args, **kwargs):
         kwargs.setdefault('resolve_entities', False)
         super(SafeXMLParser, self).__init__(*args, **kwargs)

+
 _ctgroup = {
-    'html': {'_parser': html.HTMLParser,
+    'html': {'_parser': HugeHTMLParser,
              '_csstranslator': HTMLTranslator(),
              '_tostring_method': 'html'},
     'xml': {'_parser': SafeXMLParser,

edited to include a possible patch

mdaniel avatar Mar 09 '18 07:03 mdaniel

Actually, since XMLParser has the same behavior, and already supports the argument, I think patching create_root_node might be preferable:

--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -39,7 +39,7 @@ def create_root_node(text, parser_cls, base_url=None):
     """Create root node for text using given parser class.
     """
     body = text.strip().encode('utf8') or b'<html/>'
-    parser = parser_cls(recover=True, encoding='utf8')
+    parser = parser_cls(recover=True, encoding='utf8', huge_tree=True)
     root = etree.fromstring(body, parser=parser, base_url=base_url)
     if root is None:
         root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)

I thought maybe adding this as an optional feature (e.g. keyword argument) might be needed for performance reasons, but I've done some simple testing, and can see no performance difference, so just having it enabled all the time should be fine:

>>> vanilla_parser = lxml.html.HTMLParser()
>>> huge_parser = lxml.html.HTMLParser(huge_tree=True)
>>> def vanilla():
...     lxml.html.fromstring(data, parser=vanilla_parser)
...
>>> def huge():
...     lxml.html.fromstring(data, parser=huge_parser)
...
>>> timeit.repeat(vanilla, number=1000)
[9.251072943676263, 9.233368926215917, 9.403614949900657]
>>> timeit.repeat(huge, number=1000)
[9.30129877710715, 9.280261006206274, 9.6553337960504]
>>> timeit.repeat(vanilla, number=1000)
[9.392337845172733, 9.374456495046616, 9.257086860947311]
>>> timeit.repeat(huge, number=1000)
[9.3193543930538, 9.24144608201459, 9.242193738929927]

stranac avatar Mar 09 '18 09:03 stranac

+1 to use huge_tree=True by default, though we need to make a check - parsel must work with old lxml which don't support this parameter.

It may be also good to expose this option somehow for parsel users; some users who do broad crawls may want to limit tree size, as a security/robustness measure.

kmike avatar Mar 12 '18 11:03 kmike

I dislike version checking and would prefer to just require new lxml with new parsel, but I'm not the one making the decision.

How would lower versions be handled? Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?

I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.

stranac avatar Mar 13 '18 21:03 stranac

I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:

  1. lxml supports huge_tree but it is disabled via the argument (see source code)
  2. lxml doesn't support huge_tree but it is enabled (either passed or on via default).

How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.

Langdi avatar Jul 10 '18 09:07 Langdi

Fixed by #116.

wRAR avatar Oct 28 '23 18:10 wRAR