warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Add support for real fuzzy matching

Open benoit74 opened this issue 9 months ago • 2 comments

This issue is a placeholder for what looks like a potential enhancements warc2zim might need to implement as some point in the future (typically in a 3.x version). It is meant to summarize the current understanding of the situation and to document issues really encountered in the wild.

Current situation (as of warc2zim 2.0)

Currently, when statically and dynamically rewriting a URL (including when computing the ZIM path of a given WARC record) the scraper applies what is called fuzzy rules. This fuzzy rule term comes from wabac vocabulary.

However, currently the scraper does not really fuzzy match, but rather only simplify/transform the ZIM path (URL has already been transformed into a ZIM path when fuzzy rules are applied).

Sample rule (in Python):

  {
    "pattern": r".*googlevideo.com/(videoplayback(?=\?)).*[?&](id=[^&]+).*",
    "replace": r"youtube.fuzzy.replayweb.page/\1?\2",
  }

The scraper checks all fuzzy rules in the list configured. The first rule with a pattern matching the ZIM path currently being rewritten is used, and the ZIM path is replaced by the replace expression.

In static URL rewriting (Python), the rewritten ZIM path is then checked for existence within the list of expected ZIM entries, and if missing URL is not rewritten.

In dynamic URL rewriting (Javascript), we do not have the list of existing ZIM entries and hence always apply the rewriting.

Limitations

The problem with this approach appears when we have situations like graceful loading of image resolutions. E.g. the server has 4 image resolution available: image_thumb.png, image_low.png, image_med.png, image_high.png (this is a simplification / illustration of what is present on Youtube and Vimeo video placeholders, as well as article images on ir.voanews.com). The HTML document contains the image_thumb.png. Once DOM is loaded (typically), JS fires and replace the image src attribute with the proper image depending on viewport resolution (e.g. image_low.png on mobiles, image_med.png on tablets, image_high.png on desktop).

Since we usually scrape with a single device in Browsertrix crawler, the WARC contains only two images (usually image_thumb.png and image_low.png since we use a mobile device simulation).

But the problem is that when the user read the ZIM file on a desktop, the JS detects a big viewport and request the image_high.png.

Making this work is possible only with a hack: introduce two fuzzy rules:

  • on matching exactly (and only) the image always requested (image_thumb.png in our example) so that scraper does not really rewrites this URL
  • on matching every other images (image_.*.png) to rewrite this to image_full.png for instance

Since the scraper stops on first matching fuzzy rule, the image_thumb.png will stay as image_thumb.png, and any other image will be rewritten to image_full.png since it did not matched the low res image.

While quite simple to implement, it comes with some limitations:

  • we cannot apply the hack to situations where the thumb suffix is also dynamic, based on viewport resolution (seen on Vimeo placeholder at least), or at least it would make the fuzzy rule very fragile since depending on which mobile device is used for crawling
  • we still store two images in the ZIM, where probably one would have been sufficient (in an offline setup, there is no big benefit of graceful loading since ZIM is usually served on a local network where bandwidth / latency is not as important as over the Internet)

Conclusion so far

It is now quite clear that scraper could benefit from "real" fuzzy matching with more advanced matching rules, as expected at the very beginning of warc2zim2. It is also clear that it is not a small feature request.

As mentioned in the introduction, I do not expect that anything will be implemented soon on this issue, but rather to continue documenting issues encountered in the wild and hacks implemented to cope with the situation.

benoit74 avatar May 21 '24 19:05 benoit74