python-readability
python-readability copied to clipboard
.text may guess the encoding incorrectly
Steps to reproduce:
import requests
from readability import Document
response = requests.get('https://polit.ru/article/2021/09/14/ps_dennet/')
print(Document(response.text).summary())
However, if we use .content:
print(Document(response.content).summary())
everything will be just fine.
May be updating README.rst is worth a shot :)
So, do you think that requests encoding guessing is reliable?
I think it is not: https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8
My point is exactly that guessing is unreliable (and therefore using .content is a better approach)
:)
Oh, thanks. That's a good point.
I'll update README.
Actually, both ways are unreliable, so I think, it is better if developers can choose the best option.
Technically,requests lib can do better guessing sometimes, because it can also access Content-type header. But that field can provide wrong info, and I know it happens sometimes.
Updated readme. Thanks to everyone involved!