mwoffliner
mwoffliner copied to clipboard
Wiktionary/Wikivoyage zim databases lag website by five months
I have a Wikitionary Zim file from December 2020, which I downloaded using the GUI kiwix-desktop interface (2020-12-10; "Pictures, Fulltext index"; 5.65 GB).
This works great for me but I'm not sure how to figure out which Wiktionary it is based on.
It lacks changes to Wiktionary made in August 2020, although it contains changes from May 2020.
Where can I find out which Wiktionary dump a Zim file is based on, and how do I find a Zim file which is based on a current version of Wiktionary?
(And where should I submit this issue?)
Are you talking about Wiktionary in English? Which content exactly is missing (two screenshots would be helpful)?
Yes English.
Here is an example of a diff from August which is missing from the December 2020 Kiwix Wiktionary Zim file. I just picked it at random, so far the December Zim file seems to be missing everything since around June or so.
https://en.wiktionary.org/w/index.php?title=rocker&diff=prev&oldid=60027083
Someone added a sense to "rocker", number 4 here:
Here's the Kiwix screenshot where you can see that it's missing:
I guess the answer to my other question is that there is no reason for the Zim file to be out of date then? Certainly as a software developer I would expect the Zim file to have embedded in it a date corresponding to when it was compiled, so that this kind of ad-hoc testing would not be necessary. Or does it get updated one word at a time, so different dictionary entries are out of date by different amounts? But in that case I would expect each entry to come with a timestamp...
@archenemies I will have a look (and move the ticket), but looks like a problem with a root cause in Wikimedia infrastructure.
@archenemies BTW, revision id, like revision date are available in the upstream link in the foorter of each article.
That's interesting about the upstream link in the footer, well "rocker" has the wrong link
https://en.wiktionary.org/wiki/?title=rocker&oldid=61038509
because it points to a revision from 4 November 2020 with the "breve below" sense #4 filled in, but the page that Kiwix serves me lacks that sense.
It looks like to be a bug in the Wikimedia REST API because it simply does not deliver the latest version (like you reported). See: https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker. This is the root of the bug.
On the mwoffliner
side, there is a weakness which is that we don't request a specific revisionid, but just take the latest. If we would retrieve https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker/61774146, then we would have get the proper content.
I will do the necessary on both sides to improve the situation.
A bug ticket has been open upstream at https://phabricator.wikimedia.org/T274359
@MananJethwani Here again this is "complicated" to change due to the architecture.
@kelson42 Thank you so much for tracking that down and re-reporting the bug
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
Just to track and keep this issue fresh, it is still impossible to open the article "Cambridge" from the 2021-09 English Wikivoyage ostensibly due to this bug. (Cambridge is a major tourist destination pre- and post-pandemic, so it is a quite serious upstream bug!)
See as well https://phabricator.wikimedia.org/T226931. It seems there is a momentum these days to fix it upstream...
"Cambridge" still inaccessible in the December Wikivoyage in English... The lag hasn't caught up yet...
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@kelson42 "Cambridge (England)" article is STILL inaccessible in wikivoyage_en_all_maxi_2022-05.zim
. Can there really be a 15-month time lag? No movement on the Phabricator ticket since December 2021.
Is there any conceivable workaround we can introduce on our side?
@Jaifroid Funny that you come back on this ticket today, 2 days ago something happened related to this bug... might be fixed in a near future at Wikimedia... but too early to be sure.
It's because I'm preparing the next Wikivoyage release, so I checked again. Thanks for the update! Fingers crossed...
Just to add some more data points, and something very strange which suggests the issue is not only with the API:
I made major updates to the article https://en.wikivoyage.org/wiki/Santa_Cruz_de_Mompox on 15th April. On 22nd April, a new Wikivoyage ZIM was created that did not contain these updates. I subsequently checked the REST API version https://en.wikivoyage.org/api/rest_v1/page/html/Santa_Cruz_de_Mompox, and it was the updated version.
I fully expected the May Wikivoyage, published on 4th May, to reflect these updates. However, it still shows the out-of-date version. Curiously, the footer points to https://en.wikivoyage.org/wiki/?title=Santa_Cruz_de_Mompox&oldid=4430914 , which is the updated version last edited 17th April, but the actual text scraped is a previous version, prior to my changes.
How can this be? Why does the footer point to a recent update, but the text does not correspond to that update?
Looks like this bug has been finally fixed upstream. Hopefully next generation ZIM file will be OK.
@kelson42 This is good news! I'll attempt to corroborate with the next iteration.
@kelson42 Unfortunately, it is still not possible to open the article "Cambridge (England)" in wikivoyage_en_all_maxi_2022-06.zim
which was scraped in the last couple of days. Also, the article "Santa Cruz de Mompox" is still the old version prior to the major edits I myself made on 15th April this year. Additionally, the footer of our scraped page directs to an updated version of the article (in history) which is NOT the one contained in our scrape.
Compare:
- https://library.kiwix.org/wikivoyage_en_all_maxi_2022-06/A/Santa_Cruz_de_Mompox (June 2022 scrape)
- https://en.wikivoyage.org/wiki/Santa_Cruz_de_Mompox (online)
You will notice missing images in our scrape, and the lede is completely different. HOWEVER; if I click on the footer link in our scrape, it points to https://en.wikivoyage.org/wiki/?title=Santa_Cruz_de_Mompox&oldid=4430914 which is the supposedly updated version. This is NOT the version that has been scraped.
So either the footer information is incorrect or something very strange is going on.
@Jaifroid Thx for testing that. I launched the scrap manualy earlier as I though I will get a feedback from you. I will need the information upstream.
@Jaifroid I tested myself and full confirm the bug is still there and reported it upstream accordingly.
Hello, I have similar problem with French Wikivoyage.
A page created in February 2022 is in the June file: https://library.kiwix.org/wikivoyage_fr_all_maxi_2022-06/A/Mont-Saint-Hilaire https://fr.wikivoyage.org/w/index.php?title=Mont-Saint-Hilaire&oldid=502132
But changes made in January 2020 are not in this file: https://library.kiwix.org/wikivoyage_fr_all_maxi_2022-06/A/Voyager_en_autocar_en_France https://fr.wikivoyage.org/w/index.php?title=Voyager_en_autocar_en_France&oldid=464098
Issues identified above persist in July Wikivoyage wikivoyage_en_all_maxi_2022-07.zim
. 'Cambridge (England)' article still not accessible at all, and article on 'Mompox' doesn't reflect extensive changes made on 22nd April. At least the footer now correctly identifies the Mompox article as the version last edited on 17th April, before these changes, which is something...
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
https://phabricator.wikimedia.org/T226931 seems the one left to fix
Thanks for the update. I came here just to note that the issues mentioned above still persist in wikivoyage_en_all_maxi_2022-10.zim
... Just to prevent this issue form being marked as stale.
Issues persist in wikivoyage_en_all_maxi_2022-12.zim
. They were supposed to have been fixed by https://phabricator.wikimedia.org/T274359, but my examples of the Cambridge (England) and the Mompox articles clearly show that the issue persists. The former continues to be inaccessible two years on, and the Mompox article still shows old information that was edited extensively over 8 months ago.
Yes, problem is still there but upstream :(