Decoding error with faulty Websites encoding
While decoding faulty websites like this one https://www.societe.com/societe/ankaboot-832320170.html
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byte
Exception ignored in: 'selectolax.lexbor.text_callback'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byt
This may be fixed, if the default policy changes from "strict" (default) to "replace" https://github.com/rushter/selectolax/blob/19ee5e054a33415c66d617a9d9a473348c16cbd0/selectolax/lexbor/node.pxi#L863
py_str = text.decode(_ENCODING, "replace")
Hi, I just had a look and turns out that the Modest HTMLParser already handles this by allowing to pass decode_errors kwarg to it:
HTMLParser(html, decode_errors="ignore")
Also for Modest the default is ignore.
I updated the code to have the same behavior for Lexbor, but still need to add tests / document, so I'll finish that over the weekend :)
Hi @JuroOravec, thanks for the awesome package! Is there any ETA on a fix for this issue?
Replace is now used. It will be available in the next release.