warc2zim
warc2zim copied to clipboard
Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside
This issue covers the "second part" of #177 where it has already been discussed.
The issue concerns resources which have multiple encoding inside (shouldn't exist ... but we are quite sure it does, even if Zimit2 test websites never had this issue).
A handcrafted" sample is a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:
- the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
- most of the file is in Windows 1252
- two characters (four bytes) are not in Windows 1252 but Chinese GB2312
- browsers display well most of the page content
We could decide to :
- stop/crash the scrapper (this is what will happen after #183 is merged)
- transfer the raw content as-is to the ZIM (without any rewriting)
- do our best to decode / rewrite as much as possible
If I'm not mistaken, @kelson42 has clearly indicated that only option 3 is acceptable from his PoV while @mgautierfr is more in favor of option 1.
I tend to prefer option 3 but consider this is not the highest priority issue we have on Zimit2, especially since we do not encountered the problem in test recipes.
To be exact, I in favor of option 3, but only if it produces something usable[*]. If it cannot be done, then option 1.
[*]Definition of usable is not a easy task:
- A page with Cyrillic content wrongly decoded and written back with a bunch of "garbage" Chinese characters is no usable.
- The same page inside a whole zim correctly encoded/decoded makes the zim archive itself usable.
- All pages wrongly decoded in the zim file and the zim archive is not usable.
- A js script wrongly encoded/decoded sending stats to the server, we don't care
- A js script wrongly encoded/decoded fetching content and setting up the html, we care.
Given #221 insight, I significantly doubt there is anything more possible
So far the scraper is not crashing anymore when there is multiple encoding in a single file, especially since https://github.com/openzim/warc2zim/pull/314
We are already close to option 3, only bad characters (in another encoding than the rest of the document) are "replaced" by "something".
I will hence close the issue, we have no track on how to handle this situation better than today, and there is nothing really annoying today. Current experience with warc2zim on https://tmp.kiwix.org/ci/test-website/bad-encoding.html is identical to the one on most browsers.