warc2zim
warc2zim copied to clipboard
Websites manipulating already rewriten URLs needing fuzzy rules are not working
The scenario has been encountered on https://ir.voanews.com, see https://github.com/openzim/zim-requests/issues/833#issuecomment-2203635680
Scenario is as follow:
- we want to rewrite image URLs with fuzzy rules so that they are capable to adapt to various screen sizes (image resolution is embedded inside the URL)
- the original page HTML contains a "default" image URL in the
<img src=...>
- this image URL is hence statically rewritten (in Python) + fuzzyfied before the HTML is pushed to the ZIM
- when the HTML is loaded, some JS code is manipulating the "default" image URL (now a fuzzified relative path to a ZIM entry) to select the proper resolution
- the URL is rewritten a second time with dynamic rewriting (in JS) ; JS code detects that URL has already been rewritten and does not rewrite it ; fuzzy rule is hence not applied on this modified URL and item is not found in the ZIM
The fact that we do not want to rewrite dynamically a URL which has already been rewritten statically is intentional to avoid problems, because we need at least special handling for this situation, and usually it is not needed to rewrite a second time.
Developing a special handling for already rewritten URL is not possible (yet) because we need to reverse the whole rewriting logic. The part manipulating the path and querystring is probably feasible (but complex), but we might also need to reverse the fuzzy rule, and this is not possible yet because the fuzzyfication is a one-way "reduction" operation in most cases.
Since the URL has been manipulated anyway by the JS, maybe just reversing the hostname change induced by the fuzzyfication would be enough in most cases, at least it would be enough in current situtation.
Example:
- HTML page URL:
https://ir.voanews.com/a/iran-elections-opposition-dissidents-figures-boycott-call/7681344.html
- Original image URL:
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jpg
- Rewriten (and hence fuzzyfied) URL:
../../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jp
- URL after JS manipulation to fetch proper resolution:
../../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg
- URL rewritten today: idem, we detect properly that URL has already been rewritten
- URL we would like to get:
./../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_high.jpg
- Reversed original image URL we need to build for this to work:
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg
(where we see that in this specific case just reversing path manipulation + reversing hostname change would be enough ... definitely not true for all fuzzy rules / website manipulations however)