zimit
zimit copied to clipboard
Incorrect relative URLs on top-level landing pages
At least in the Type 1 internet-encyclopedia-philosophy_en_all_2022-07.zim
(currently in dev directory), the image on the landing page has a URL of ../wp-content/media/mill.jpg
. The landing page's URL is C/A/iep.utm.edu/
. Logically, therefore, the relative URL points to the image being at C/A/wp-content/media/mill.jpg
. However this is of course wrong, as the image is in fact located at C/A/iep.utm.edu/wp-content/media/mill.jpg
.
I have also seen this referencing error on the landing page of the Type 0 military medicine ZIM. So I think the issue is in fact with landing pages, or with the root directory of a Zimit stored domain, and not with Type 0 and Type 1.
Kiwix Desktop is able to locate the image on the landing page of the IAP ZIM and the military medicine ZIM. Perhaps it has a feature that prevents a relative link from walking higher up the tree than the domain part of the WARC-style URL? It's still strange that the URLs are technically incorrect, given that this does not occur on pages other than the landing page (in my experience). I guess I need to work around this?
OK, I had forgotten that I had in fact already worked around this issue in relation to Type 0 ZIMs but had forgotten to report the issue. I have code that checks whether a URL beginning with ..
is located at the top level of the domain represented in the WARC-style ZIM, and removes the ../
if so. Now of course because of the extra /A/
that Type 1 ZIMs have as a prefix to the URL, the detection of this failed, and the code believes that A/iep.utm.edu/
does not represent a top-level URL. This is easily fixed.
Nevertheless, I do think there is an issue in the way these URLs have been stored in the ZIM -- even though the Replay system is a bit of a black box (albeit with source code), URLs should still be stored in an internally consistent way in the ZIM, IMHO!
@Jaifroid Indeed, most important is to report bug!
@Jaifroid I don't understand why kiwix-desktop is part of the tickez, considering that kiwix-desktop has no support for SW based ZIM files !
@kelson42 I was using the indirect support, i.e. when you allow Kiwix Desktop to serve the decoded ZIM files directly to the browser. I was a bit imprecise, I guess it's the server feature of Kiwix Desktop, rather than Kiwix Desktop itself.
I believe this can be either handled in replay/reader (what you did – but then that needs to be properly documented) or at crawling level. Would you have a an URL that we could test it with the WARC toolchain (so small or easily scopable) and maybe discuss having it handled on the crawler ?