python-readability icon indicating copy to clipboard operation
python-readability copied to clipboard

.text may guess the encoding incorrectly

Open 097115 opened this issue 4 years ago • 4 comments

Steps to reproduce:

import requests
from readability import Document
response = requests.get('https://polit.ru/article/2021/09/14/ps_dennet/')
print(Document(response.text).summary())

However, if we use .content:

    print(Document(response.content).summary())

everything will be just fine.

May be updating README.rst is worth a shot :)

097115 avatar Sep 15 '21 05:09 097115

So, do you think that requests encoding guessing is reliable? I think it is not: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8

buriy avatar Sep 15 '21 10:09 buriy

My point is exactly that guessing is unreliable (and therefore using .content is a better approach)

:)

097115 avatar Sep 15 '21 10:09 097115

Oh, thanks. That's a good point. I'll update README. Actually, both ways are unreliable, so I think, it is better if developers can choose the best option. Technically,requests lib can do better guessing sometimes, because it can also access Content-type header. But that field can provide wrong info, and I know it happens sometimes.

buriy avatar Sep 15 '21 11:09 buriy

Updated readme. Thanks to everyone involved!

buriy avatar Dec 09 '22 07:12 buriy