warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Add failure thresholds for missing links

Open benoit74 opened this issue 9 months ago • 1 comments

Currently, warc2zim is very permissive regarding issues that may arise while rewriting documents.

This is mostly mandatory due to

  • the nature of website encountered in the wild which are not always well written
  • the fact that many URLs have been blocked by ad-blocker during crawl (for good reasons obviously)

However, warc2zim would probably benefit from a threshold mechanism to fail the scraper should issues be too numerous.

For instance, if it is clear that if more than xx% (10? 20?) of the links present on the homepage have failed to be rewritten, then it means the ZIM is most probably not usable at all. It should probably be feasible to make a distinction between missing resources (image, JS typically) which might come from an ad, and missing targets of hyperlinks (which are rarely from an ad, especially on home page).

The same threshold (or another value, still not clear) can probably be applied to other HTML pages.

Some experimentation is most probably needed to decide on the right thresholds to put in place, but warc2zim would benefit from this mostly naive QA feature, either because it is stupid to have a home page with many missing links or because it is hard to detect that crawler has been blocked at some point and many subpages are missing (e.g. 80% of the site is missing, but all links on the home page are present because they have been crawled first).

benoit74 avatar May 21 '24 20:05 benoit74