warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Zimit2: Allow deduplication of entries

Open benoit74 opened this issue 11 months ago • 3 comments

It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files.

It already did so for Zimit1, but maybe it is time to address the problems.

The first obvious problem is that lots of content is duplicated inside the ZIM due to different URLs leading to the same content. I think this could be pretty easily addressed (even if it clearly means additional processing to deduplicate).

{
    "check": "redundant",
    "level": "WARNING",
    "message": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png and solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path1": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path2": "solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png"
},

For a website like solar.lowtechmagazine.com which is available in multiple languages, it could even make a significant difference in terms of final file size (not sure if compression achieves to cancel duplicated content like this well, at least some persons says it is not possible, e.g. https://superuser.com/a/479083).

benoit74 avatar Mar 01 '24 08:03 benoit74

The new alias might be of help

rgaudin avatar Mar 01 '24 08:03 rgaudin

To me Zimcheck "warnings" are not a priority to treat, in particular for the moment. Should be a feature request IMO and descoped from the "Zimit2" project.

One solution proposal for this deduplication feature has been made years ago at scraperlib level.

kelson42 avatar Mar 06 '24 09:03 kelson42

Treating all Zimcheck "warnings" is maybe not a priority, but avoiding to create artificially big ZIMs could be considered from my PoV. I do not mind if we de-scope this.

I don't know why someone proposed a PR to fix https://github.com/openzim/python-scraperlib/issues/33 but never finished the job !

I'm joking of course, I was probably very tired or angry about someone else this day. I intend to finish this PR to fix this zimit2 issue, it was not that far from being OK.

benoit74 avatar Mar 07 '24 09:03 benoit74