zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Approaches for implementing Zimit ZIM reader

Open Jaifroid opened this issue 2 years ago • 8 comments

(This should probably be a discussion, but I don't think the Discussions tab is enabled on this Repo.)

This is a followup of discussion elsewhere, to leave a record of what is entailed in a regex-based approach that does not use the Zimit-provided Service Worker. The approach is heuristic, and has been implemented experimentally in Kiwix JS Windows/Linux. This implementation is done entirely with regular expressions, and some supporting logic in the backend.

To see the regex transformations I implemented, look at https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/transformZimit.js .

Some of those transformations set flags that are acted on elsewhere in the app, like extending the dirEntry object to include a dirEntry.inspect property (instruction to the app to extract redirect information from the page loaded by the dirEntry), a dirEntry.nullify property (to prevent this dirEntry from loading), and dirEntry.zimitRedirect property that instructs the app to redirect to another dirEntry (by path). This is different from a classic ZIM redirection dirEntry (which redirects by pointer). I'm sure there are other ways to do this, but as we don't use libzim yet in KJS or KJSW, this was an easy way for me to send instructions to the backend.

The grunt work is done by the transformReplayUrls() function. This is intended as a filter for the data from textual documents (HTML, CSS, JS) that contain URLs to ZIM assets. The data are filtered before being sent to the browser. The function uses the dirEntry.mimetype to select the transformation types required. Some, like transformation of links in CSS and JS, need to be run also on data with the html mimetype because they can contain embedded JS and embedded CSS.

The basic transformation principle is actually really simple:

  • https://example.com/doc.html should be transformed to A/example.com/doc.html (this may need to be prefixed by an absolute URL to the app's base directory) - NB don't hardcode the A namespace, as it depends on ZIM type (a hypothetical future Zimit ZIM will use C instead);
  • links like //youtube.com/videoid (i.e., links beginning with //), are treated in the same way, as if they were https://... and transformed;
  • root-relative links like /local-directory/doc.html may need massaging to add a ZIM prefix;
  • other relative links should look after themselves.

There are of course edge cases, like the one mentioned in previous discussion of finding links in JavaScript like https:\/\/example.com\/doc.html.

There are some ZIM-specific transformations you may see there. These can be ignored, and were part of early experiments. Most can be removed.

In some cases I have to remove analytics scripts that block page loading for 10-20 seconds until they fail. They shouldn't really be in the ZIM IMHO, but they are, and they try to phone home 😮. Having a CSP or other method to block external requests is essential with these ZIMs.

There are cases in which we should, in principle, check the request header for a URL in the H/ namespace, but I have yet to find a case where this yields genuinely useful information that is not otherwise obvious (fuzzy URLs will probably be one of those cases). There are cases of requests for image URLs returning a "Not found" HTML page, that need to be recognized, and not accidentally treated as image data. There are cases of "Moved permanently" responses to all kinds of assets. Here the header could be useful, but the info is in the response anyway.

The main advantage of this approach is that we can do away with the need for Service Workers. Also, regexes are insanely fast, low-level things, much faster than trawling the DOM with JavaScript.

The main disadvantage is that using heuristic pattern-matching will never catch absolutely everything out there on the Web. Most cases are predictable, but how could we ever transform a link constructed entirely in code, e.g., one that gets the protocol from the current page (though that would be poor coding)? On the other hand, I cannot see how a Service Worker implementation could ever catch such things either, not without help from a Web View that can trap those requests and redirect them or prefix them.

Reminder: Service Workers can only be installed per-domain and with a specific scope of that domain, and they have to be installed over https: served from the domain. A Zimit ZIM can contain absolute links to more than one domain. We don't have the luxury of faking these in KJS (no Web View), so we have to transform them before the browser gets to interpret them.

Jaifroid avatar May 31 '22 15:05 Jaifroid

Thank you @Jaifroid for this summary. Very useful. I am curious yet doubtful about the result we can achieve with approach.

I see you are relying on the content of zimit webpages to extract some information. This is not an API obviously so make sure to subscribe to warc2zim as this may break without notice.

rgaudin avatar Jun 01 '22 09:06 rgaudin

I'm aware that in some cases I should be using the H/ header instead of relying on information in the response. I intend to implement that if the regex approach isn't superseded by a better one.

Jaifroid avatar Jun 01 '22 10:06 Jaifroid

@rgaudin wrote:

I am curious yet doubtful about the result we can achieve with approach.

I'd call it pretty useable, but not perfect. Here are several of the ZIMs in https://download.kiwix.org/zim/zimit/ all running fine, if not necessarily with all features of the UI, but no blockers.

image

I would very much like to try to find a way to run the Replay system. It would be cleaner, and maintenance would be someone else's problem. 😉

Jaifroid avatar Jun 01 '22 23:06 Jaifroid

@ikreymer Are you able to answer or give pointers for a few questions about how the webac.js project manages URLs? It would be hugely helpful to myself and others (@mgautierfr, @mossroy) given that documentation of the webac API is not yet available. I have tried looking at the source code, but it is all split into many different modules and I don't know where to begin with it (I did find where the Service Worker handles Fetch).

My questions are:

  • How do you trap and transform (in JavaScript) absolute URLs to different scraped domains (https://example.com/, https://youtu.be) that are in a page extracted from the archive (where the domain is one of those included in the archive)?
  • Since a Service Worker can only trap requests from the domain in which it is running, do you transform absolute links on a page (and in assets?) before you present a page to the browser (insert it in iframe)? I see reference to a prefix in the Service Worker.
  • Assuming you do transform links in code before a page is presented to the browser, is there an API for this that we could use in our apps? If not, where can I find the code that does this (i.e., where/how is the prefix added)?

I'm sorry if the answers are easily available in the source code. I did try, but my experience is limited. As you know, we have difficulties with running a Service Worker when one is already installed in the scope of the iframe that displays pages, so we're looking to break down the process into different parts (transformations necessary so that a Service Worker can trap the correct Fetch requests, and then transformations that can be done after a Service Worker has intercepted a request).

Thanks in advance for any hints you can kindly provide. The Replay project is very inspiring, and is a natural fit for Kiwix.

Jaifroid avatar Jun 02 '22 11:06 Jaifroid

https://github.com/webrecorder/wabac.js/blob/c099c9f62bd168c85f9cf618a0f7602177bfcb29/src/collection.js#L364 might be of interest

rgaudin avatar Jun 02 '22 13:06 rgaudin

  • How do you trap and transform (in JavaScript) absolute URLs to different scraped domains (https://example.com/, https://youtu.be) that are in a page extracted from the archive (where the domain is one of those included in the archive)?

Yes, all URLs are rewritten to point to the prefix, in addition, JS is injected to emulate the original environment of the page.

  • Since a Service Worker can only trap requests from the domain in which it is running, do you transform absolute links on a page (and in assets?) before you present a page to the browser (insert it in iframe)? I see reference to a prefix in the Service Worker.

Yes, URLs found in standard HTML elements are rewritten before it is served to the browser.

  • Assuming you do transform links in code before a page is presented to the browser, is there an API for this that we could use in our apps? If not, where can I find the code that does this (i.e., where/how is the prefix added)?

The Rewriter object can be used directly, https://github.com/webrecorder/wabac.js/blob/main/src/rewrite/index.js#L135, though unfortunately don't have enough documentation yet on how to use it. It can produce a rewritten response for HTML, JS, or CSS that is ready to be served to the browser.

The wombat.js library is also injected into the page to emulate the JS environment.

While you're right that simple regex rewriting might work for simple static pages, there will be many edge-cases where this doesn't work, and it'll need to be maintained and updated (constantly).

It seems like the main issue is that you have an existing service worker and that conflicts with the built-in service worker added to the ZIMs, right? It should actually be possible to load a service worker in another service worker through the importScripts() and call functions directly, though wabac.js doesn't yet support this use case. Ideally, the SW would not be part of the ZIM, but part of the viewer, however the current approach was taken to support existing Kiwix viewers with minimal modifications.

This isn't probably what you want to hear, but to me, imo is sort of going in circles. We have a replay system that's designed to load web archives (WACZ) directly in a browser using a browser-based viewer (wabac.js / replayweb.page). (We've been able to load WACZ files as big as 1.3TB thus far.) To support a non-browser-based viewer, these are converted to ZIM with the special warc2zim process. Now, you're trying to replay these again in the browser, using the original approach (service worker-based replay), using parts of the original system (wabac.js). IMO this conversion back and forth will lead to loss of fidelity and even more maintenance difficulty, and beyond simple static pages, there will certainly be issues.

If you want to run a crawl with zimit, and then replay it in the browser, can probably just add an option to generate the equivalent WACZ file and it should just work -- maybe that should be an option in zimit since its already available. We haven't figured out the best interface between ZIM and WACZ yet, there's probably more that can be done there, to make it as simple as possible.

ikreymer avatar Jun 02 '22 14:06 ikreymer

Thank you very much, @ikreymer. I'll digest what you've written!

I understand about going in circles, but the other technologies are not ready yet, and I guess it's a bigger decision to make whether to adopt the WACZ format inside a ZIM. There are lots of considerations, as it would impact on many assumptions we currently make about ZIM format in readers. A big example here is searching for articles or assets. We would have to have a clear API that would allow us to leverage Replay search from the ZIM reader's UI.

I'm more and more convinced we can get the existing Replay system running. Currently sw.js loads as a Web Worker when we try to access a Zimit ZIM, but it won't load as a Service Worker. It should be easy to hand off requests for assets in the Zimit ZIM to the sw.js Web Worker if we can access all needed transformations via its API.

Jaifroid avatar Jun 02 '22 15:06 Jaifroid

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 10:08 stale[bot]