selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

Decoding error with faulty Websites encoding

Open baderdean opened this issue 1 year ago • 2 comments

While decoding faulty websites like this one https://www.societe.com/societe/ankaboot-832320170.html

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byte
Exception ignored in: 'selectolax.lexbor.text_callback'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byt

This may be fixed, if the default policy changes from "strict" (default) to "replace" https://github.com/rushter/selectolax/blob/19ee5e054a33415c66d617a9d9a473348c16cbd0/selectolax/lexbor/node.pxi#L863

py_str = text.decode(_ENCODING, "replace")

baderdean avatar Oct 23 '24 17:10 baderdean

Hi, I just had a look and turns out that the Modest HTMLParser already handles this by allowing to pass decode_errors kwarg to it:

HTMLParser(html, decode_errors="ignore")

Also for Modest the default is ignore.

I updated the code to have the same behavior for Lexbor, but still need to add tests / document, so I'll finish that over the weekend :)

JuroOravec avatar Oct 25 '24 07:10 JuroOravec

Hi @JuroOravec, thanks for the awesome package! Is there any ETA on a fix for this issue?

pineapple-pokopo avatar Dec 30 '24 10:12 pineapple-pokopo

Replace is now used. It will be available in the next release.

rushter avatar Sep 27 '25 20:09 rushter