mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Wiktionary/Wikivoyage zim databases lag website by five months

Open archenemies opened this issue 3 years ago • 26 comments

I have a Wikitionary Zim file from December 2020, which I downloaded using the GUI kiwix-desktop interface (2020-12-10; "Pictures, Fulltext index"; 5.65 GB).

This works great for me but I'm not sure how to figure out which Wiktionary it is based on.

It lacks changes to Wiktionary made in August 2020, although it contains changes from May 2020.

Where can I find out which Wiktionary dump a Zim file is based on, and how do I find a Zim file which is based on a current version of Wiktionary?

(And where should I submit this issue?)

archenemies avatar Feb 06 '21 20:02 archenemies

Are you talking about Wiktionary in English? Which content exactly is missing (two screenshots would be helpful)?

kelson42 avatar Feb 07 '21 03:02 kelson42

Yes English.

Here is an example of a diff from August which is missing from the December 2020 Kiwix Wiktionary Zim file. I just picked it at random, so far the December Zim file seems to be missing everything since around June or so.

https://en.wiktionary.org/w/index.php?title=rocker&diff=prev&oldid=60027083

Someone added a sense to "rocker", number 4 here:

screenshot-2021-02-06_20 12 34

Here's the Kiwix screenshot where you can see that it's missing:

screenshot-2021-02-06_20 12 47

I guess the answer to my other question is that there is no reason for the Zim file to be out of date then? Certainly as a software developer I would expect the Zim file to have embedded in it a date corresponding to when it was compiled, so that this kind of ad-hoc testing would not be necessary. Or does it get updated one word at a time, so different dictionary entries are out of date by different amounts? But in that case I would expect each entry to come with a timestamp...

archenemies avatar Feb 07 '21 04:02 archenemies

@archenemies I will have a look (and move the ticket), but looks like a problem with a root cause in Wikimedia infrastructure.

kelson42 avatar Feb 07 '21 08:02 kelson42

@archenemies BTW, revision id, like revision date are available in the upstream link in the foorter of each article.

kelson42 avatar Feb 07 '21 09:02 kelson42

That's interesting about the upstream link in the footer, well "rocker" has the wrong link

https://en.wiktionary.org/wiki/?title=rocker&oldid=61038509

because it points to a revision from 4 November 2020 with the "breve below" sense #4 filled in, but the page that Kiwix serves me lacks that sense.

archenemies avatar Feb 07 '21 17:02 archenemies

It looks like to be a bug in the Wikimedia REST API because it simply does not deliver the latest version (like you reported). See: https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker. This is the root of the bug.

On the mwoffliner side, there is a weakness which is that we don't request a specific revisionid, but just take the latest. If we would retrieve https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker/61774146, then we would have get the proper content.

I will do the necessary on both sides to improve the situation.

kelson42 avatar Feb 10 '21 12:02 kelson42

A bug ticket has been open upstream at https://phabricator.wikimedia.org/T274359

kelson42 avatar Feb 10 '21 13:02 kelson42

@MananJethwani Here again this is "complicated" to change due to the architecture.

kelson42 avatar Feb 10 '21 13:02 kelson42

@kelson42 Thank you so much for tracking that down and re-reporting the bug

archenemies avatar Feb 10 '21 18:02 archenemies

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Jun 02 '21 16:06 stale[bot]

Just to track and keep this issue fresh, it is still impossible to open the article "Cambridge" from the 2021-09 English Wikivoyage ostensibly due to this bug. (Cambridge is a major tourist destination pre- and post-pandemic, so it is a quite serious upstream bug!)

Jaifroid avatar Sep 23 '21 10:09 Jaifroid

See as well https://phabricator.wikimedia.org/T226931. It seems there is a momentum these days to fix it upstream...

kelson42 avatar Dec 05 '21 10:12 kelson42

"Cambridge" still inaccessible in the December Wikivoyage in English... The lag hasn't caught up yet...

Jaifroid avatar Dec 17 '21 16:12 Jaifroid

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 08:03 stale[bot]

@kelson42 "Cambridge (England)" article is STILL inaccessible in wikivoyage_en_all_maxi_2022-05.zim. Can there really be a 15-month time lag? No movement on the Phabricator ticket since December 2021.

Is there any conceivable workaround we can introduce on our side?

Jaifroid avatar May 07 '22 10:05 Jaifroid

@Jaifroid Funny that you come back on this ticket today, 2 days ago something happened related to this bug... might be fixed in a near future at Wikimedia... but too early to be sure.

kelson42 avatar May 07 '22 10:05 kelson42

It's because I'm preparing the next Wikivoyage release, so I checked again. Thanks for the update! Fingers crossed...

Jaifroid avatar May 07 '22 12:05 Jaifroid

Just to add some more data points, and something very strange which suggests the issue is not only with the API:

I made major updates to the article https://en.wikivoyage.org/wiki/Santa_Cruz_de_Mompox on 15th April. On 22nd April, a new Wikivoyage ZIM was created that did not contain these updates. I subsequently checked the REST API version https://en.wikivoyage.org/api/rest_v1/page/html/Santa_Cruz_de_Mompox, and it was the updated version.

I fully expected the May Wikivoyage, published on 4th May, to reflect these updates. However, it still shows the out-of-date version. Curiously, the footer points to https://en.wikivoyage.org/wiki/?title=Santa_Cruz_de_Mompox&oldid=4430914 , which is the updated version last edited 17th April, but the actual text scraped is a previous version, prior to my changes.

How can this be? Why does the footer point to a recent update, but the text does not correspond to that update?

Jaifroid avatar May 08 '22 20:05 Jaifroid

Looks like this bug has been finally fixed upstream. Hopefully next generation ZIM file will be OK.

kelson42 avatar May 26 '22 16:05 kelson42

@kelson42 This is good news! I'll attempt to corroborate with the next iteration.

Jaifroid avatar May 26 '22 20:05 Jaifroid

@kelson42 Unfortunately, it is still not possible to open the article "Cambridge (England)" in wikivoyage_en_all_maxi_2022-06.zim which was scraped in the last couple of days. Also, the article "Santa Cruz de Mompox" is still the old version prior to the major edits I myself made on 15th April this year. Additionally, the footer of our scraped page directs to an updated version of the article (in history) which is NOT the one contained in our scrape.

Compare:

  • https://library.kiwix.org/wikivoyage_en_all_maxi_2022-06/A/Santa_Cruz_de_Mompox (June 2022 scrape)
  • https://en.wikivoyage.org/wiki/Santa_Cruz_de_Mompox (online)

You will notice missing images in our scrape, and the lede is completely different. HOWEVER; if I click on the footer link in our scrape, it points to https://en.wikivoyage.org/wiki/?title=Santa_Cruz_de_Mompox&oldid=4430914 which is the supposedly updated version. This is NOT the version that has been scraped.

So either the footer information is incorrect or something very strange is going on.

Jaifroid avatar Jun 05 '22 13:06 Jaifroid

@Jaifroid Thx for testing that. I launched the scrap manualy earlier as I though I will get a feedback from you. I will need the information upstream.

kelson42 avatar Jun 05 '22 16:06 kelson42

@Jaifroid I tested myself and full confirm the bug is still there and reported it upstream accordingly.

kelson42 avatar Jun 08 '22 08:06 kelson42

Hello, I have similar problem with French Wikivoyage.

A page created in February 2022 is in the June file: https://library.kiwix.org/wikivoyage_fr_all_maxi_2022-06/A/Mont-Saint-Hilaire https://fr.wikivoyage.org/w/index.php?title=Mont-Saint-Hilaire&oldid=502132

But changes made in January 2020 are not in this file: https://library.kiwix.org/wikivoyage_fr_all_maxi_2022-06/A/Voyager_en_autocar_en_France https://fr.wikivoyage.org/w/index.php?title=Voyager_en_autocar_en_France&oldid=464098

xinxinxinxinxin avatar Jul 11 '22 11:07 xinxinxinxinxin

Issues identified above persist in July Wikivoyage wikivoyage_en_all_maxi_2022-07.zim. 'Cambridge (England)' article still not accessible at all, and article on 'Mompox' doesn't reflect extensive changes made on 22nd April. At least the footer now correctly identifies the Mompox article as the version last edited on 17th April, before these changes, which is something...

Jaifroid avatar Jul 19 '22 05:07 Jaifroid

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 03:09 stale[bot]

https://phabricator.wikimedia.org/T226931 seems the one left to fix

kelson42 avatar Oct 15 '22 16:10 kelson42

Thanks for the update. I came here just to note that the issues mentioned above still persist in wikivoyage_en_all_maxi_2022-10.zim... Just to prevent this issue form being marked as stale.

Jaifroid avatar Oct 23 '22 14:10 Jaifroid

Issues persist in wikivoyage_en_all_maxi_2022-12.zim. They were supposed to have been fixed by https://phabricator.wikimedia.org/T274359, but my examples of the Cambridge (England) and the Mompox articles clearly show that the issue persists. The former continues to be inaccessible two years on, and the Mompox article still shows old information that was edited extensively over 8 months ago.

Jaifroid avatar Jan 01 '23 13:01 Jaifroid

Yes, problem is still there but upstream :(

kelson42 avatar Jan 01 '23 13:01 kelson42