warc2zim
warc2zim copied to clipboard
Zimit2: Allow deduplication of entries
It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files.
It already did so for Zimit1, but maybe it is time to address the problems.
The first obvious problem is that lots of content is duplicated inside the ZIM due to different URLs leading to the same content. I think this could be pretty easily addressed (even if it clearly means additional processing to deduplicate).
{
"check": "redundant",
"level": "WARNING",
"message": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png and solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
"path1": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
"path2": "solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png"
},
For a website like solar.lowtechmagazine.com which is available in multiple languages, it could even make a significant difference in terms of final file size (not sure if compression achieves to cancel duplicated content like this well, at least some persons says it is not possible, e.g. https://superuser.com/a/479083).
The new alias
might be of help
To me Zimcheck "warnings" are not a priority to treat, in particular for the moment. Should be a feature request IMO and descoped from the "Zimit2" project.
One solution proposal for this deduplication feature has been made years ago at scraperlib level.
Treating all Zimcheck "warnings" is maybe not a priority, but avoiding to create artificially big ZIMs could be considered from my PoV. I do not mind if we de-scope this.
I don't know why someone proposed a PR to fix https://github.com/openzim/python-scraperlib/issues/33 but never finished the job !
I'm joking of course, I was probably very tired or angry about someone else this day. I intend to finish this PR to fix this zimit2 issue, it was not that far from being OK.