zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Incorrect relative URLs on top-level landing pages

Open Jaifroid opened this issue 1 year ago • 5 comments

At least in the Type 1 internet-encyclopedia-philosophy_en_all_2022-07.zim (currently in dev directory), the image on the landing page has a URL of ../wp-content/media/mill.jpg. The landing page's URL is C/A/iep.utm.edu/. Logically, therefore, the relative URL points to the image being at C/A/wp-content/media/mill.jpg. However this is of course wrong, as the image is in fact located at C/A/iep.utm.edu/wp-content/media/mill.jpg.

I have also seen this referencing error on the landing page of the Type 0 military medicine ZIM. So I think the issue is in fact with landing pages, or with the root directory of a Zimit stored domain, and not with Type 0 and Type 1.

Kiwix Desktop is able to locate the image on the landing page of the IAP ZIM and the military medicine ZIM. Perhaps it has a feature that prevents a relative link from walking higher up the tree than the domain part of the WARC-style URL? It's still strange that the URLs are technically incorrect, given that this does not occur on pages other than the landing page (in my experience). I guess I need to work around this?

Jaifroid avatar Aug 14 '22 08:08 Jaifroid

OK, I had forgotten that I had in fact already worked around this issue in relation to Type 0 ZIMs but had forgotten to report the issue. I have code that checks whether a URL beginning with .. is located at the top level of the domain represented in the WARC-style ZIM, and removes the ../ if so. Now of course because of the extra /A/ that Type 1 ZIMs have as a prefix to the URL, the detection of this failed, and the code believes that A/iep.utm.edu/does not represent a top-level URL. This is easily fixed.

Nevertheless, I do think there is an issue in the way these URLs have been stored in the ZIM -- even though the Replay system is a bit of a black box (albeit with source code), URLs should still be stored in an internally consistent way in the ZIM, IMHO!

Jaifroid avatar Aug 14 '22 09:08 Jaifroid

@Jaifroid Indeed, most important is to report bug!

kelson42 avatar Aug 14 '22 09:08 kelson42

@Jaifroid I don't understand why kiwix-desktop is part of the tickez, considering that kiwix-desktop has no support for SW based ZIM files !

kelson42 avatar Aug 14 '22 12:08 kelson42

@kelson42 I was using the indirect support, i.e. when you allow Kiwix Desktop to serve the decoded ZIM files directly to the browser. I was a bit imprecise, I guess it's the server feature of Kiwix Desktop, rather than Kiwix Desktop itself.

Jaifroid avatar Aug 14 '22 14:08 Jaifroid

I believe this can be either handled in replay/reader (what you did – but then that needs to be properly documented) or at crawling level. Would you have a an URL that we could test it with the WARC toolchain (so small or easily scopable) and maybe discuss having it handled on the crawler ?

rgaudin avatar Aug 15 '22 08:08 rgaudin