warc2zim Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside

Open benoit74 opened this issue 1 year ago • 1 comments

This issue covers the "second part" of #177 where it has already been discussed.

The issue concerns resources which have multiple encoding inside (shouldn't exist ... but we are quite sure it does, even if Zimit2 test websites never had this issue).

A handcrafted" sample is a page like https://tmp.kiwix.org/ci/test-website/bad-encoding.html where:

the declared encoding is wrong (it is not saved in UTF-8 as declared in HTML meta tag but Windows 1252)
most of the file is in Windows 1252
two characters (four bytes) are not in Windows 1252 but Chinese GB2312
browsers display well most of the page content

We could decide to :

stop/crash the scrapper (this is what will happen after #183 is merged)
transfer the raw content as-is to the ZIM (without any rewriting)
do our best to decode / rewrite as much as possible

If I'm not mistaken, @kelson42 has clearly indicated that only option 3 is acceptable from his PoV while @mgautierfr is more in favor of option 1.

I tend to prefer option 3 but consider this is not the highest priority issue we have on Zimit2, especially since we do not encountered the problem in test recipes.

Feb 14 '24 12:02 benoit74

To be exact, I in favor of option 3, but only if it produces something usable[*]. If it cannot be done, then option 1.

[*]Definition of usable is not a easy task:

A page with Cyrillic content wrongly decoded and written back with a bunch of "garbage" Chinese characters is no usable.
The same page inside a whole zim correctly encoded/decoded makes the zim archive itself usable.
All pages wrongly decoded in the zim file and the zim archive is not usable.
A js script wrongly encoded/decoded sending stats to the server, we don't care
A js script wrongly encoded/decoded fetching content and setting up the html, we care.

Feb 14 '24 14:02 mgautierfr

Given #221 insight, I significantly doubt there is anything more possible

May 17 '24 15:05 benoit74

So far the scraper is not crashing anymore when there is multiple encoding in a single file, especially since https://github.com/openzim/warc2zim/pull/314

We are already close to option 3, only bad characters (in another encoding than the rest of the document) are "replaced" by "something".

I will hence close the issue, we have no track on how to handle this situation better than today, and there is nothing really annoying today. Current experience with warc2zim on https://tmp.kiwix.org/ci/test-website/bad-encoding.html is identical to the one on most browsers.

Jul 26 '24 13:07 benoit74

warc2zim warc2zim copied to clipboard

Zimit2: do not crash when a rewriten resource (HTML/CSS/JS) has multiple encoding inside

warc2zim
warc2zim copied to clipboard