ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
When scraping the URL: http://www.reakes.com/
Error:
Traceback (most recent call last):
File ".../website_scraping.py", line 161, in getWebsiteScrapedDataForURL
data = extruct.extract(r.text, base_url=base_url)
File "/usr/local/lib/python3.8/site-packages/extruct/_extruct.py", line 58, in extract
tree = parse_xmldom_html(htmlstring, encoding=encoding)
File "/usr/local/lib/python3.8/site-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.8/site-packages/lxml/html/__init__.py", line 875, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.8/site-packages/lxml/html/__init__.py", line 761, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I would expect a scraper to fail gracefully and not let a problem with lxml propagate out, but not sure what the philosophy of extruct is specifically.
This might be helpful: https://stackoverflow.com/questions/15830421/xml-unicode-strings-with-encoding-declaration-are-not-supported
Thanks for raising the issue :+1:
would expect a scraper to fail gracefully and not let a problem with lxml propagate out, but not sure what the philosophy of extruct is specifically.
The philosophy is to propagate the errors by default, but there is a parameter which allows to ignore or log the errors via the "errors" argument to extract function: https://github.com/scrapinghub/extruct/blob/a64ce58210a2151d0d760fb3ef7b28d0691e539a/extruct/_extruct.py#L30-L31
Let's keep the issue open to tackle the underlying problem.
@lopuhin Got it. Is there a way to log the failures, but as a warning/info instead of an error?
@advance512 not really, I'm afraid that if you wish to adjust the level of logging you'd have to catch the exception yourself.
@lopuhin but then I won't get the actual valid data that is extracted..
Btw a comment on anyone experiencing the original issue - even though this is something which we should fix, as a work-around, you could pass the HTML as bytes (as sent by the server) instead of a string.