requests-html icon indicating copy to clipboard operation
requests-html copied to clipboard

fix parse html RecursionError

Open 521xueweihan opened this issue 3 years ago • 2 comments

fix parse html

https://db-engines.com/en/ranking

RecursionError

521xueweihan avatar Oct 20 '21 02:10 521xueweihan

Reproduce:

Python 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> p = session.get('https://db-engines.com/en/ranking')
>>> p.html.text
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 33, in fromstring
    return _parse(data, beautifulsoup, makeelement, **bsargs)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 79, in _parse
    root = _convert_tree(tree, makeelement)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 152, in _convert_tree
    res_root = convert_node(html_root)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 216, in convert_node
    return handler(bs_node, parent)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  [Previous line repeated 985 more times]
  File "/usr/lib/python3.10/site-packages/lxml/html/soupparser.py", line 242, in convert_tag
    res = etree.SubElement(parent, bs_node.name, attrib=attribs)
  File "src/lxml/etree.pyx", line 3156, in lxml.etree.SubElement
  File "src/lxml/apihelpers.pxi", line 199, in lxml.etree._makeSubElement
  File "src/lxml/apihelpers.pxi", line 195, in lxml.etree._makeSubElement
  File "src/lxml/etree.pyx", line 1630, in lxml.etree._elementFactory
  File "src/lxml/classlookup.pxi", line 403, in lxml.etree._parser_class_lookup
  File "src/lxml/classlookup.pxi", line 456, in lxml.etree._custom_class_lookup
  File "/usr/lib/python3.10/site-packages/lxml/html/__init__.py", line 734, in lookup
    if node_type == 'element':
RecursionError: maximum recursion depth exceeded in comparison
>>>

surister avatar Feb 26 '23 13:02 surister

@521xueweihan

I'd love to see a test for this and perhaps the proposed fix could be slightly refactored since we could do

try:
    ...
except (Exception1, Exception2):
    pass

I reckon it's being a couple of years, I might understand that you are no longer interested nor active in this repo, In a few days I will do it myself, I will reference this PR to try give you some credit.

surister avatar Feb 26 '23 13:02 surister