warc2zim
warc2zim copied to clipboard
Dynamic URL rewriting: rewrite only when ZIM path exists
In dynamic URL rewriting (in JS with wombat), all URLs are rewritten.
This is a fair assumption because in most cases the associated resources have also been automatically fetched at crawled time.
It also makes the things way simpler since we do not need to pass the list of existing ZIM entries to JS.
Question is: do we want to rewrite only existing URLs?
While this would allow less 404 client-side, it also comes with the big drawback that the browser will suddenly begin to fetch online resources because the URL has not been rewritten because ZIM entry was missing. The user will not even know/realize that, and might incur data costs (for instance). This will also make testing a warc2zim ZIM even harder, because it might looks like the ZIM is working, but indeed some resources have been fetched online and will hence not be available to all our users.