extruct icon indicating copy to clipboard operation
extruct copied to clipboard

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Open advance512 opened this issue 5 years ago • 5 comments

When scraping the URL: http://www.reakes.com/

Error:

Traceback (most recent call last):
  File ".../website_scraping.py", line 161, in getWebsiteScrapedDataForURL
    data = extruct.extract(r.text, base_url=base_url)
  File "/usr/local/lib/python3.8/site-packages/extruct/_extruct.py", line 58, in extract
    tree = parse_xmldom_html(htmlstring, encoding=encoding)
  File "/usr/local/lib/python3.8/site-packages/extruct/utils.py", line 16, in parse_xmldom_html
    return lxml.html.fromstring(html, parser=parser)
  File "/usr/local/lib/python3.8/site-packages/lxml/html/__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/lib/python3.8/site-packages/lxml/html/__init__.py", line 761, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

I would expect a scraper to fail gracefully and not let a problem with lxml propagate out, but not sure what the philosophy of extruct is specifically.

This might be helpful: https://stackoverflow.com/questions/15830421/xml-unicode-strings-with-encoding-declaration-are-not-supported

advance512 avatar Jun 09 '20 17:06 advance512

Thanks for raising the issue :+1:

would expect a scraper to fail gracefully and not let a problem with lxml propagate out, but not sure what the philosophy of extruct is specifically.

The philosophy is to propagate the errors by default, but there is a parameter which allows to ignore or log the errors via the "errors" argument to extract function: https://github.com/scrapinghub/extruct/blob/a64ce58210a2151d0d760fb3ef7b28d0691e539a/extruct/_extruct.py#L30-L31

Let's keep the issue open to tackle the underlying problem.

lopuhin avatar Jun 10 '20 07:06 lopuhin

@lopuhin Got it. Is there a way to log the failures, but as a warning/info instead of an error?

advance512 avatar Jun 16 '20 15:06 advance512

@advance512 not really, I'm afraid that if you wish to adjust the level of logging you'd have to catch the exception yourself.

lopuhin avatar Jun 16 '20 15:06 lopuhin

@lopuhin but then I won't get the actual valid data that is extracted..

advance512 avatar Jun 16 '20 15:06 advance512

Btw a comment on anyone experiencing the original issue - even though this is something which we should fix, as a work-around, you could pass the HTML as bytes (as sent by the server) instead of a string.

lopuhin avatar Dec 06 '21 10:12 lopuhin