feedparser
feedparser copied to clipboard
encodings: decode utf-8 with errors='replace' when confident
"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".
Background of the patch
When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.
To handle it better, we should decode the feed as UTF-8 with errors='replace'.
- I met the problem at https://github.com/Rongronggg9/RSS-to-Telegram-Bot/issues/391
- Feed URL: http://iptvin.ru/component/jcomments/?task=rss&object_id=1000707&object_group=com_content&tmpl=component
- Snapshot of the feed: iptvin.xml.gz
- Snapshot of HTTP headers:
Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8
Please accept "Pull requests". Everything works as it should with him!