html-text Handle non-breaking spaces and other special unicode characters

See discussion in https://github.com/TeamHG-Memex/html-text/pull/2#issuecomment-304737274

May 30 '17 07:05 lopuhin

not sure if this is the same issue, but I'm getting:

ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
    tree = _cleaned_html_tree(html)
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
    tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed

Apparently this is a new strictness introduced by python 3. Possibly using surrogateescape flag in encode could help...?

Also see:

https://stackoverflow.com/questions/31898353/python-cant-encode-with-surrogateescape
https://stackoverflow.com/a/38147966

Oct 02 '17 15:10 codinguncut

Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement into html_text.extract_text.

Oct 02 '17 16:10 lopuhin

The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.

I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.

Oct 02 '17 16:10 kmike

FTR, response.css / response.xpath also don't work for this website.

Oct 02 '17 16:10 kmike