feedparser icon indicating copy to clipboard operation
feedparser copied to clipboard

encodings: decode utf-8 with errors='replace' when confident

Open Rongronggg9 opened this issue 2 years ago • 2 comments

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

  • I met the problem at https://github.com/Rongronggg9/RSS-to-Telegram-Bot/issues/391
    • Feed URL: http://iptvin.ru/component/jcomments/?task=rss&object_id=1000707&object_group=com_content&tmpl=component
    • Snapshot of the feed: iptvin.xml.gz
    • Snapshot of HTTP headers:
Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8

Rongronggg9 avatar Dec 24 '23 20:12 Rongronggg9

Please accept "Pull requests". Everything works as it should with him!

butaford avatar Jan 23 '24 08:01 butaford