mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Infoboxes are missing in many articles

Open Inbefortus opened this issue 3 years ago • 15 comments

Kiwix Version: 3.4.4 ZIM: 2021-03 German Wikipedia

Additional information on the right side is not scraped and thus being lacking.

Kiwix:

https://user-images.githubusercontent.com/71934042/124442750-57493c80-dd7d-11eb-9e77-744b803816c7.mp4

https://user-images.githubusercontent.com/71934042/124442760-57e1d300-dd7d-11eb-9d00-7abd081b6c64.mp4

Original German Wikipedia:

20210705_103359

20210705_103519

Inbefortus avatar Jul 05 '21 08:07 Inbefortus

@Inbefortus I see this content in wikipedia_de_all_maxi_2020-06.zim when rendered in Kiwix JS PWA. So it's either an issue with the latest ZIM, or it's an issue with the reader. If you try your ZIM with pwa.kiwix.org (just visit that page in a browser and pick your file), let us know if the content is still missing. I suggest you turn on the destktop style (in Configuration) for the closest rendering to what you're seeing above, though I still see the content in mobile style too on the earlier ZIM, just not so well formatted.

Jaifroid avatar Jul 05 '21 09:07 Jaifroid

@Inbefortus I see this content in wikipedia_de_all_maxi_2020-06.zim when rendered in Kiwix JS PWA. So it's either an issue with the latest ZIM, or it's an issue with the reader. If you try your ZIM with pwa.kiwix.org (just visit that page in a browser and pick your file), let us know if the content is still missing. I suggest you turn on the destktop style (in Configuration) for the closest rendering to what you're seeing above, though I still see the content in mobile style too on the earlier ZIM, just not so well formatted.

It's conclusively an issue with the ZIM file. I remain the equal outcome:

Screenshot_20210705-114126_Samsung Internet

Screenshot_20210705-114202_Samsung Internet

Inbefortus avatar Jul 05 '21 09:07 Inbefortus

You should compare what is comparable. I mean the mobile output, here are screenshots from the Desktop output.

kelson42 avatar Jul 05 '21 10:07 kelson42

Original German Wikipedia (mobile version):

Screenshot_20210705-124144_Samsung Internet Screenshot_20210705-124210_Samsung Internet

Kiwix JS PWA (mobile version): Screenshot_20210705-124604_Samsung Internet Screenshot_20210705-124620_Samsung Internet

Inbefortus avatar Jul 05 '21 10:07 Inbefortus

OK, so I concur that the problem is that ZIM version, possibly some change in MWOffliner or Parsoid between 06/2020 and 03/2021. Image below, for reference, is the desktop version of the 06/2020 German ZIM in Kiwix JS PWA.

image

Jaifroid avatar Jul 05 '21 12:07 Jaifroid

I have verified and this is like I said: the mobile API does not deliver the infobox: https://de.wikipedia.org/api/rest_v1/page/mobile-sections/Schlacht_bei_Wavre

@Inbefortus The screenshot you provide is taken with the mobile view, with your dekstop browser, and indeed has the infobox. No clue how this is built, but this is not using the mobile API (like MWoffliner).

The Wikipedia Android App uses the mobile API and does not provide the infobox as well.

We can regret that this is not provided by the mobile API of Wikipedia, but this is not our decision. We can as well regret that MWoffliner does not scrape from the Desktop API, but we already have decided a few years ago that we would focus on mobile as we don't have the resources to provide both (dekstop+mobile). @Jaifroid That said, I wonder that in 2020/06 we were still doing based on Desktop... but that does not change much about what I said earlier.

So far MWoffliner works as intended. Closing the ticket.

kelson42 avatar Jul 08 '21 12:07 kelson42

@kelson42 2020-06 is scraped from the Mobile API (the entire ZIM has mobile styles, the desktop views I showed were merely the application of a desktop style by the reader). For whatever reason, the API stopped providing the infoboxes between 2020-06 and today, at least for some ZIMs and some pages. There are definitely infoboxes in other current Wikimedia ZIMs (well, I haven't downloaded new ones for about a month). Maybe there is something special about these particular infoboxes? Sounds like a bug in Parsoid if just these infoboxes are missing from the API...

Jaifroid avatar Jul 08 '21 12:07 Jaifroid

Do we have a similar infobox which is included?

kelson42 avatar Jul 08 '21 12:07 kelson42

@kelson42 It depends what you mean by "similar". There are Infoboxes on almost every article of the most recent Wikipedia-based ZIM I have, which is wikipedia_en_medicine-app_maxi_2021-06.zim. An example in the fist screenshot below.

The German Wikipedia 2020-06 has the "Waterloo" infobox, but it is not identified by class as an infobox. Maybe this is part of the issue with the 2021-03 ZIM, if the API has recently been updated to select infoboxes by class... See screenshot bottom from the 2020-06 ZIM.

image

image

Jaifroid avatar Jul 08 '21 14:07 Jaifroid

@Jaifroid @Inbefortus After thinking twice about that, I don't want to challenge the Wikimedia team about that. To me this is not an obvious bug, even if I personaly would prefer to have always the infobox. I don't really want to start any discussion about this should be in or this should be out when the problem is not obvious. Therefore, either you open an upstream ticket yourself (would be interested to follow it) and we link it to this ticket (and keep this ticket open) or I will close this ticket (because there nothing more which can be done at my level).

kelson42 avatar Aug 28 '21 15:08 kelson42

@kelson42 I think I'd need to be sure that this is a consistent upstream error with specific infobox types before filing it as a Parsoid (?) bug. There are infoboxes all over Wikipedia that are perfectly well represented in Kiwix ZIMs with mobile style, yet there are some that are missing for no apparent reason. It's not that all infoboxes are missing by any means. So either it's a random bug, or there are specific infoboxes that are being accidentally omitted by the mobile Parsoid API, even though they are somehow shown in the mobile view of Wikipedia online. We need more info, especially with the latest scrapes, before being able to claim this is an API bug.

Jaifroid avatar Aug 29 '21 07:08 Jaifroid

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Nov 09 '21 20:11 stale[bot]

@kelson42 @Jaifroid I just realized that in the latest German Wikipedia (2022-06), infoboxes are also now missing in all articles about films/series.

ZomboDroid 30062022221959

One truly wonders if this is perhaps a possible bug or intentional? If this continues over time and at some point all infoboxes in articles about people, countries, animals, cities, etc. are no longer accessible, a lot of important information would be lost.

However, to settle this once and for all, I will be creating an upstream ticket tomorrow, so stay tuned!

Inbefortus avatar Jun 30 '22 21:06 Inbefortus

@kelson42 @Jaifroid Here it is:

  • https://phabricator.wikimedia.org/T311817

Inbefortus avatar Jul 01 '22 09:07 Inbefortus

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 03:09 stale[bot]

@Inbefortus and @Jaifroid : There is a similar issue with the German Wiktionary and its .ZIM file.

I filled a bug upstream, and both issues might be related: https://phabricator.wikimedia.org/T319303

If you have any ideas, please follow that thread too. Thanks !

Immunize2 avatar Oct 04 '22 15:10 Immunize2

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar May 26 '23 18:05 stale[bot]