python-scraperlib Automatically redirect to articles with same checksum

As discussed in https://github.com/openzim/sotoki/pull/162#issuecomment-660452579, it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of a resource and create redirects if that's being duplicated (or fail intelligently so we can handle).

Jul 18 '20 15:07 satyamtg

To me, for the moment, such a feature should better be in python-scraperlib (or any higher level library) because:

This is too smart to be done in the libzim
I believe basically the scraper (not the libzim) should be able to do things in a clean manner
I understand under certain special conditions the high level scraper might better rely for a certain range of articles of a lower level smart feature like this

Jul 18 '20 18:07 kelson42

Also, as discussed with @kelson42, articles have no checksum in the ZIM. I was led to think that based on zimcheck's duplicates output but it's zimcheck calculating those.

What we could do is have a helper in scraperlib that calculates checksums, stores them and compares them to adjust behavior (create redirects?). That would be extra and should be enabled on a subset of articles via some filtering pattern. The main use case would be for zimit where the scraper has no control over the content. In this case, if the zimcheck reports duplicates, we could enable this mechanism in the recipe by specifying the filtering patterns.

This feature could have a HUGE impact on resources (CPU, RAM, potentially IO) so it's goal will be to clear duplicates for the case it cannot be done in the scraper. Non-generic scrapers should take care of duplicates themselves.

Jul 20 '20 10:07 rgaudin

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Oct 10 '20 03:10 stale[bot]

I will start to work on an implementation of this issue. Will open a PR once I have something ready to review. I will try to follow advises mentioned above

Apr 20 '22 19:04 benoit74

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Aug 13 '22 10:08 stale[bot]

Maybe we should better use aliases?

Dec 16 '23 17:12 kelson42

Maybe we should better use aliases?

Doesn't solve anything. We still don't know ahead of adding the entry that it's a duplicate otherwise we'd probably do thing differently depending on the scraper: not include the resource, use an alias or a redirect.

Dec 16 '23 17:12 rgaudin