Badly rewritten URLs in case of "/" in the article title
The forward slash in the URL is causing problems in Kiwix JS (see https://github.com/kiwix/kiwix-js/issues/494). I wonder if it might be causing problems for you too. For example, the article Singapore references its assets like this:
<link href="../-/s/css_modules/ext.kartographer.link.css" …>
Whereas the article Singapore/Riverside references its assets like this:
<link href="../../-/s/css_modules/ext.kartographer.link.css" …>
Note the extra ../../, because the browser thinks it's accessing an article Riverside in the subdirectory Singapore. This would all be fine, except that the hrefs in the Singapore/Riverside article are written like this:
<p>The <b>Singapore River</b> forms a central artery in <a href="Singapore"
title="Singapore">Singapore</a>'s densely packed Central Business District.
Note that to access the hyperlink on this page from a browser, we would need to write that link as <a href="../Singapore" ...>.
So there seems to be some inconsistency. Are we inside a subdirectory from the browser's perspective, or not? The subroutine that writes the location of the assets seems to think so, while the one that writes the location of the hyperlinks doesn't! Do you think this is an mwoffliner issue?
I reopen the ticket as this is still not done properly. Same example with Wikivoyage Singapore/Riverside article. The link to Suntec City and the Esplanade is wrong. It is ../../A/../Singapore%2FMarina_Bay and it should be ../../A/Singapore/Marina_Bay (So two errors). @ISNIT0 Please create a proper unit test to test URL writing completely so we can be sure that we don't do any regression on that anymore in the future. Please also put me in reviewer of the PR, so I can have a look in details on the solution.
Yea, this is some code I've tried not to touch so far - it's very complicated and old :(
I'd like to refactor once we have good tests for it
Just tried again and CSS is not loaded properly in Wikivoyage "EN Paris/2nd arrondissement" article
Seems to work now... not sure what happend here.
No, it's not, I have created a new ZIM file with git master head of wikinews FR, and here is what zimcheck reports (we have broken internal links):
$ zimcheck out/wikinews_fr_all_2019-07.zim
[INFO] Checking zim file out/wikinews_fr_all_2019-07.zim
[INFO] Verifying Internal Checksum..
[INFO] Internal checksum found correct
[INFO] Searching for metadata entries..
[INFO] Searching for Favicon..
[INFO] Searching for main page..
[INFO] Verifying Articles' content..
[INFO] Searching for redundant articles..
Verifying Similar Articles for redundancies..
[ERROR] Invalid internal links found :
A/Belgique_:_Belgacom/Évènements_du_10_août_2014 (%C3%89v%C3%A8nements_du_10_ao%C3%BBt_2014) was not found in article A/Belgique_:_Belgacom/swing.be,_le_médiateur_des_télécoms_s'interroge_et_déplore_l'attitude_de_Belgacom.
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_Les_résultats_de_la_vingt-septième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_Les_r%C3%A9sultats_de_la_vingt-septi%C3%A8me_journ%C3%A9e#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_Les_résultats_de_la_vingt-septième_journée
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_Lyon_vainqueur_à_Toulouse_et_Lille_à_Marseille (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_Lyon_vainqueur_%C3%A0_Toulouse_et_Lille_%C3%A0_Marseille#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_Lyon_vainqueur_à_Toulouse_et_Lille_à_Marseille
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_le_suspense_continue (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_le_suspense_continue#cite_ref-euro_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_le_suspense_continue
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_12ème_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_12%C3%A8me_journ%C3%A9e#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_12ème_journée
A/Championnat_de_France_2007/Football_:_décès_d'Antonio_Puerta (Football_%3A_d%C3%A9c%C3%A8s_d'Antonio_Puerta) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_6ème_journée_(suite)
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-cinquième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-cinqui%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-cinquième_journée
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-quatrième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-quatri%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-quatrième_journée
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-troisième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-troisi%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-troisième_journée
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Le_Havre_AC_en_Ligue_1_la_saison_prochaine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Le_Havre_AC_en_Ligue_1_la_saison_prochaine#cite_ref-l1_1-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Le_Havre_AC_en_Ligue_1_la_saison_prochaine
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Match_nul_entre_Sedan_et_Bastia (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Match_nul_entre_Sedan_et_Bastia#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Match_nul_entre_Sedan_et_Bastia
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Une_saison_se_termine#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_le_FC_Nantes_en_Ligue_1_la_saison_prochaine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_le_FC_Nantes_en_Ligue_1_la_saison_prochaine#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_le_FC_Nantes_en_Ligue_1_la_saison_prochaine
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_les_résultats_de_la_trente-quatrième_journée (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_les_r%C3%A9sultats_de_la_trente-quatri%C3%A8me_journ%C3%A9e#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_les_résultats_de_la_trente-quatrième_journée
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_match_nul_entre_Angers_et_Nantes (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_match_nul_entre_Angers_et_Nantes#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_match_nul_entre_Angers_et_Nantes
A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_score_vierge_entre_Grenoble_et_Ajaccio (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_score_vierge_entre_Grenoble_et_Ajaccio#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_score_vierge_entre_Grenoble_et_Ajaccio
I/m/Flag_of_Scotland_(traditional).svg.png (../I/m/Flag_of_Scotland_(traditional).svg.png) was not found in article A/Championnat_de_Ligue_celtique_2014-2015_:_les_résultats_de_la_dix-neuvième_journée
A/Congo/Évènements_du_23_avril_2019 (%C3%89v%C3%A8nements_du_23_avril_2019) was not found in article A/Congo/Rwanda_:_une_quarantaine_de_corps_retrouvés_après_naufrage_sur_lac_Kivu
A/De_l'alcool_et_du_sucre_dans_la_comète_C/Évènements_du_1er_novembre_2015 (%C3%89v%C3%A8nements_du_1er_novembre_2015) was not found in article A/De_l'alcool_et_du_sucre_dans_la_comète_C/2014_Q2_(Lovejoy)
A/Découverte_d'oxygène_sur_la_comète_67P/Évènements_du_31_octobre_2015 (%C3%89v%C3%A8nements_du_31_octobre_2015) was not found in article A/Découverte_d'oxygène_sur_la_comète_67P/Tchourioumov-Guérassimenko
A/Europe_de_l'Ouest_:_vigilance_orange_neige/Évènements_du_9_février_2013 (%C3%89v%C3%A8nements_du_9_f%C3%A9vrier_2013) was not found in article A/Europe_de_l'Ouest_:_vigilance_orange_neige/verglas_en_France,_en_Belgique_et_au_Luxembourg
A/France/Évènements_du_15_avril_2017 (%C3%89v%C3%A8nements_du_15_avril_2017) was not found in article A/France/Suisse_:_mort_de_l'auteur_des_tirs_de_1982_contre_la_centrale_de_Creys-Malville
A/Les_États-Unis_font_usage_d'une_GBU-43/Évènements_du_12_avril_2017 (%C3%89v%C3%A8nements_du_12_avril_2017) was not found in article A/Les_États-Unis_font_usage_d'une_GBU-43/B_pour_la_première_fois_en_Afghanistan
A/Passage_de_la_comète_C/Évènements_du_12_mars_2013 (%C3%89v%C3%A8nements_du_12_mars_2013) was not found in article A/Passage_de_la_comète_C/2011_L4_(PANSTARRS)
[ERROR] Invalid external links found :
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&lang=cs&trackformat=srt&origin=%2Ais an external dependence in article A/2011_sur_Wikinews
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3ABangladesh_building_collapse_-_WN.ogv&lang=bn&trackformat=srt&origin=%2Ais an external dependence in article A/Bangladesh_:_effondrement_mortel_d'un_immeuble
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3ABangladesh_building_collapse_-_WN.ogv&lang=bn&trackformat=srt&origin=%2Ais an external dependence in article A/Bangladesh_:_émeutes_et_revendications_après_l'effondrement_mortel_d'un_immeuble
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3AWhite_House_Spokesman_Spicer_Holds_News_Conference.webm&lang=en&trackformat=srt&origin=%2Ais an external dependence in article A/Le_porte-parole_de_la_Maison_Blanche_attaque_les_médias_lors_de_sa_première_allocution
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&lang=cs&trackformat=srt&origin=%2Ais an external dependence in article A/Oussama_ben_Laden_tué_lors_d'un_raid_près_d'Islamabad
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3AWalesCalltoAction.ogv&lang=ca&trackformat=srt&origin=%2Ais an external dependence in article A/Wikipédia_célèbre_son_dixième_anniversaire_en_ligne
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3APresident_Obama_speaks_on_attacks_in_Boston_%282013-04-16%29.ogv&lang=en&trackformat=srt&origin=%2Ais an external dependence in article A/États-Unis_:_3_morts_et_plus_de_170_blessés_dans_l'attentat_de_Boston
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&lang=cs&trackformat=srt&origin=%2Ais an external dependence in article A/États-Unis_:_Obama_demande_une_enquête_sur_Ben_Laden
https://commons.wikimedia.org/w/api.php?action=timedtext&title=File%3APresident_Obama_speaks_on_explosions_in_Boston_%282013-04-15%29.ogv&lang=en&trackformat=srt&origin=%2Ais an external dependence in article A/États-Unis_:_attentats_à_Boston
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: 272 seconds.
If I looks the HTML, I see this kind of link in
href="../../A/Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Une_saison_se_termine#cite_note-1"
``
The URL should not be encoded... and in fact it should be only `href="#cite_note-1"`.
@kelson42 where can I get wikinews_fr_all_2019-07.zim for testing? It's not on the Kiwix Download server.
@Jaifroid I have built it with mwoffliner.
Maybe it would be helpful if I try to document clearly the problem, and what policy decisions need to be taken to solve it. I will use wikivoyage_en_all_novid_2019-06.zim and its landing page as the example here, because it is available on the server, and because the landing page is clearly affected by this issue. TLDR: skip to the Conclusions section below, and consult Evidence if you need clarification, though I do think the evidence is important.
Evidence
- The online URL of the Wikivoyage landing page we use is: https://en.wikivoyage.org/wiki/Wikivoyage:Offline_reader_Expedition/Home_page . To the browser, this looks like the path is
wiki/Wikivoyage:Offline_reader_Expedition/and the requested document isHome_page. However, the title of this page isWikivoyage:Offline reader Expedition/Home page – Travel guide at Wikivoyage, i.e. the page title has a forward slash in it. - The directory entry corresponding to this landing page in
wikivoyage_en_all_novid_2019-06.zimlooks like the image below. Please note that both the title and url fields are the same, and that they both contain a non-percent-encoded forward slash beforeHome_page:
- Links on this page (in our ZIM) are written like this:
<a title="Europe" href="Europe">Europe</a>. From the browser's perspective, this is incorrect. The href should be../Europe. - According to https://github.com/kiwix/apple/issues/157, the iOS app is crashing when clicking on one of these links. This is undoubtedly because it cannot find the dirEntry due to the missing
../; - The Kiwix Desktop Beta app, however, does follow these links BUT this app does not deal correctly with ZIMs that actually have subdirectories: see https://github.com/kiwix/kiwix-desktop/issues/206 .
- The fix submitted previously in this PR was supposed to correct this issue, though we don't yet have a ZIM produced by this fix to test.
- @kelson42 has produced a ZIM with this fix, but zimcheck finds errors in it. We need to know how zimcheck deals with forward slashes in URLs, however.
Conclusions
- A policy decision needs to be taken as to whether pages with forward slashes in their URLs are genuinely to be read by the browser as a path, or whether the last forward slash e.g.
Expedition/Home_pageshould be percent-encoded. Note that on Wikivoyage (online) it is not percent-encoded. There are lots of other examples: e.g.Singapore/Riverside,Singapore/Marina_Bay. - If the last slash should be percent-encoded, it needs to be percent-encoded by mwoffliner, not by the reader. The reader has no way of knowing whether it is in a subdirectory or not, and shouldn't need extra logic to try and guess.
- The iOS and Desktop apps are using different strategies for working out the path from the dirEntry.url. IMHO, the iOS app is correct (though it shouldn't crash), and the Desktop app is wrong, given that the last
/is not percent-encoded in dirEntry.url. What strategy is zimcheck using?. Is it reading the files like the iOS app, or like the Desktop app? - Once a clear policy decision has been taken on this, it needs to be implemented here and the respective devs of the different clients need to be informed.
It was decided a while ago by @kelson42 (not sure where), that we should not encode the /. If the / is encoded, we have a bug.
The links within articles should be aware of the /, and use ../ if necessary.
Zim files have no notion of a 'directory', and the / is in-fact part of the article ID.
Directories only exist when doing a zimdump of the contents, and the / ids are turned into directory structures.
There are url tests here: https://github.com/openzim/mwoffliner/blob/master/test/unit/urlRewriting.test.ts
If you believe we're missing a test, please let me know :)
It seems from @kelson42's example that some things are breaking (perhaps it's that there are many /s in one article id)?
My gut feeling is that for every link from an article with a / in the id, we should ../ back to the root of the zim file:
A link from this article
A/this/article/has/many/slashes.html
to
A/this/article/has/article.html
should become:
../../../../A/this/article/has/article.html
@ISNIT0 , thank you for the explanation. This is what I thought, but discussion in another thread regarding percent-encoding the forward slash had caused me to doubt if this was the agreed position. I do agree with the consistent approach you outline here. It means that as soon as a new Wikivoyage is made with a version of mwoffliner that includes the fix in this PR, @mossroy and I should be able to merge our PR that implements the correct reader behaviour in Kiwix JS. It should be noted that the iOS app is in fact correct in not being able to find the links on the landing page of wikivoyage_en_all_novid_2019-06.zim (made prior to the fix in this PR), and that the Desktop Beta app will need to be fixed to follow this consistent approach with forward slashes.
@kelson42 : shouldn't wikivoyage_en_all_novid_2019-06.zim be removed from https://download.kiwix.org/zim/wikivoyage/ , as the main page has buggy links? (or more generally the ZIM files affected by this issue in their main page)
Although most of the content is available, it's very frustrating that the first click on a link of the main page fails.
@kelson42 : shouldn't
wikivoyage_en_all_novid_2019-06.zimbe removed from https://download.kiwix.org/zim/wikivoyage/ , as the main page has buggy links? (or more generally the ZIM files affected by this issue in their main page) Although most of the content is available, it's very frustrating that the first click on a link of the main page fails.
@mossroy I have uploaded new versions of these ZIM files. Please let me know if this is better.
The links in the main page of wikivoyage_en_all_novid_2019-07.zim are working well on the version of kiwix-js we're about to release, where they were failing in the main page of wikivoyage_en_all_novid_2019-06.zim
So yes, it's really better!
Thank you @kelson42 . I can also confirm that the relative links correctly lead to the appropriate articles, both from the landing page (which has a forward slash in its URL) and from the page Singapore/Riverside. These pages work fine in our code at Kiwix JS.
While these relative liniks are perfectly correct, they are not quite as economical as they could be. It's not a problem, but maybe worth noting. Relative links on the landing page look like this:
<a title="Europe" href="../../A/Europe">Europe</a>
It would be sufficient for them to be coded like this:
<a title="Europe" href="../Europe">Europe</a>
In other words, the relative links take us two levels up and then back down into the current namespace, which isn't necessary. However, if it's a "safe" way to produce correct relative links, this is great. Thank you!
While these relative liniks are perfectly correct, they are not quite as economical as they could be. It's not a problem, but maybe worth noting.
@Jaifroid Thank you very much about this comment. Surprisingly I had overseen that. Agree with your comment, and in fact I think this is maybe worse than what you think. I have open a ticket which re-interpret your comment a bit here https://github.com/openzim/mwoffliner/issues/904. Lets see what we can do about that.
@kelson42 All of the broken links/articles in your example above seem to actually be links to the Spécial:Recherche namespace. I'm not really sure what the expected behaviour here is.
@ISNIT0 I don't really understand how this error A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Une_saison_se_termine#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine is related to Spécial:Recherche. Could please explain it to me?
The only links on this page: https://fr.wikipedia.org/wiki/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine Are special in some way
@kelson42 I think we'll need to talk about this one
@ISNIT0 anytime :)
I am unable to reproduce the issue with article list scrapes, so I'm doing a full scrape of fr.wikinews. Here's hoping it wont take too long :)
@ISNIT0 That won't help at all.
@kelson42 I was able to reproduce this with master by doing a full fr.wikinews scrape, but now I'm unable to reproduce. I believe this is fixed. Will close for now.
I'm afraid this issue has recurred in wikipedia_en_medicine_maxi_2019-08.zim. This is a very recent ZIM, so I assume it should have fixes from this issue? If that's not the case, then please ignore, and I'll check again next month.
Issue: Hyperlinks on the landing page of this wikimed ZIM are missing a ../. Links to stylesheets and other assets are correct, but not the hyperlinks.
The URL of the page is Wikipedia:WikiProject_Medicine/Open_Textbook_of_Medicine2 (note the forward slash). Hyperlinks to articles on this page are written like this:
<a href="Book%3AChildren's_health" title="Book:Children's health">...</a>
As written, this hyperlink should lead to a page Wikipedia%3AWikiProject_Medicine/Book%3AChildren's_health, which is clearly wrong. Our client (Kiwix JS) throws an error (correctly).
Links to assets on this page are correctly written:
<img src="../../I/m/MedLogoNoWiFi.png" …>
@Jaifroid I confirm :(((((
@ISNIT0 Please re-implement this URL rewritting properly like suggested in #904 and extend the automated tests that we never have to reopen again this ticket.
@Jaifroid As far as I can tell, this is a different instance of a very general ticket. I've proposed a fix (#954 ) for this specific problem, which is that the slash re-writing wasn't being applied properly for desktop scrapes.
@kelson42 #904 is not related to @Jaifroid 's bug (and the fix in #954)
Thanks @ISNIT0. When you say "desktop scrapes", do you mean ZIMs with "desktop style" or something else? FWIW, this ZIM appears to have a mobile style. It has the "minerva" mobile style.
@jaifroid sorry, I was unclear - the main page is always scraped as a desktop version, so it gets treated differently to the individual articles
Still not working fine and obviously not automated tested properly http://library.kiwix.org/wikivoyage_en_all_maxi_2019-10/A/Osaka%2FNorth (link comes from http://library.kiwix.org/wikivoyage_en_all_maxi_2019-10/A/Osaka/Kita) :(