libzim icon indicating copy to clipboard operation
libzim copied to clipboard

Wikipedia_en_top_all has 829k entries instead of 50k

Open Popolechien opened this issue 5 months ago • 11 comments

ZIM(s) location

https://browse.library.kiwix.org/#lang=eng&q=best+of+wikipedia

Recipe(s) URL

https://farm.openzim.org/recipes/wikipedia_en_top

Readers tested

  • [ ] Kiwix-serve on iOS (iPad / iPhone)
  • [ ] Kiwix-serve on Android (phone or tablet)
  • [ ] Kiwix-serve on Windows
  • [ ] Kiwix-serve on Linux
  • [ ] Kiwix-serve on Raspberry Pi (e.g. hotspot)
  • [ ] Kiwix-serve on Mac
  • [ ] pwa.kiwix.org
  • [ ] Kiwix JS - Chrome extension
  • [ ] Kiwix JS - Firefox extension
  • [ ] Kiwix JS - Edge extension
  • [x] Kiwix for Android application
  • [x] Kiwix for MacOS application
  • [ ] Kiwix for iOS (iPad/iPhone) application

Which ZIM versions are impacted?

All PROD versions are impacted

Details

Two users reported on reddit that the zim file as a lot more entries than expected, both on Apple and Android devices

Image

Popolechien avatar Jun 29 '25 09:06 Popolechien

@benoit74 This lools like the redirects would be wrongly counted in the articleCount!

kelson42 avatar Jun 29 '25 10:06 kelson42

Copying @Jaifroid's comment here as it may be an interesting insight:

the format of the recent Wikimedia ZIMs produced by mwOffliner has switched from minorVersion 0 (with a separate A/ article namespace) to minorVersion 3 (with only a C/ content namespace which includes all user content, including images, and no old titleIndex). If the software is looking for a article count based on the length of the old title index, it's going to calculate the wrong value. I'm not sure this is what is happening, but given that it used to show the correct article count and no longer does since May/June this year

Popolechien avatar Jun 29 '25 10:06 Popolechien

These numbers should be based on Counter Metadata, see https://wiki.openzim.org/wiki/Metadata. The libkiwix provides the primitives to have both article and media counts.

kelson42 avatar Jun 29 '25 10:06 kelson42

Is this really that important? There is really only 50k articles, I'm sure about that. File size is correct. Who cares about the rest ...

And counter is even correct: https://browse.library.kiwix.org/raw/wikipedia_en_top_maxi_2025-06/meta/Counter

benoit74 avatar Jun 29 '25 20:06 benoit74

JFYI that was a comment I wrote on Reddit replying to someone who had queried the sudden change in the article count for top ZIMs. See https://www.reddit.com/r/Kiwix/comments/1lmmei1/why_do_the_new_best_of_wikipedia_zims_say_they/ . I personally don't care about it, but clearly some Redditors do, so I thought I'd give my best guess as to what might be going on.

I wouldn't see this as top priority, but if it's an easy fix, it should be fixed in due ourse. It is misleading to show 859,640 articles when there are in fact only 50,000.

Jaifroid avatar Jun 29 '25 20:06 Jaifroid

And counter is even correct:

I can not say formsure if it is the reason of this bug, but this counter does not respect the spec in many parts of the string.

We should fix this and there is good chances that this will fix the bug.

Moving to MWoffliner.

kelson42 avatar Jun 30 '25 03:06 kelson42

@kelson42 can you explain what is wrong in the counter value so that we have a chance to fix it?

application/javascript=4;application/pdf=3;image/apng=1;image/gif=5166;image/jpeg=280;image/png=124;image/svg+xml=8;image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381;image/webp=524540;text/css=28;text/html=50000;text/html; charset=iso-8859-1=1;text/javascript=3

Note that we do not position this Counter metadata in mwoffliner scraper at all, so should something be wrong in this, this is at least half libzim fault 🤣

benoit74 avatar Jun 30 '25 07:06 benoit74

Actually this is not checked properly in zimcheck either, therefore making a feature request.

kelson42 avatar Jun 30 '25 11:06 kelson42

@benoit74 The Counter metadata is indeed written by the libzim, based on the mime-types given by MWoffliner. See https://github.com/openzim/libzim/blob/main/src/writer/counterHandler.h for the exact piece of code.

In the string application/javascript=4;application/pdf=3;image/apng=1;image/gif=5166;image/jpeg=280;image/png=124;image/svg+xml=8;image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381;image/webp=524540;text/css=28;text/html=50000;text/html; charset=iso-8859-1=1;text/javascript=3 which is the Counter metadata for [/wikipedia_en_top_maxi_2025-06.zim I see following problems:

  • image/svg+xml; which has not =xyz part ... and we already have an entry image/svg+xml=8. This seems to be a bug in the libzim... but probably triggered by an incongruity given by MWoffliner
  • profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381; where profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0" does not look like a mime-type. Here I believe there is a bad handling of the mime-type parameter profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0". Here again it looks more like a bug in the libzim
  • Again a value without number at text/html;

To conclude, kind of agree that this is at least 90% a bug in the libzim... an therefore probably not a regression (which is surprising to me considering the visibility of the bug).

Who wants to code pathes in C++? ;)

kelson42 avatar Jun 30 '25 11:06 kelson42

In the string application/javascript=4;application/pdf=3;image/apng=1;image/gif=5166;image/jpeg=280;image/png=124;image/svg+xml=8;image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381;image/webp=524540;text/css=28;text/html=50000;text/html; charset=iso-8859-1=1;text/javascript=3 which is the Counter metadata for [/wikipedia_en_top_maxi_2025-06.zim I see following problems:

  • image/svg+xml; which has not =xyz part ... and we already have an entry image/svg+xml=8. This seems to be a bug in the libzim... but probably triggered by an incongruity given by MWoffliner

  • profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381; where profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0" does not look like a mime-type. Here I believe there is a bad handling of the mime-type parameter profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0". Here again it looks more like a bug in the libzim

It rather seems to me that this is a result of having a MIME-type string image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0" for most of (67381) SVGs. Similarly, there is one HTML page that comes with a MIME-type string of text/html; charset=iso-8859-1.

veloman-yunkan avatar Jul 03 '25 13:07 veloman-yunkan

In the string application/javascript=4;application/pdf=3;image/apng=1;image/gif=5166;image/jpeg=280;image/png=124;image/svg+xml=8;image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381;image/webp=524540;text/css=28;text/html=50000;text/html; charset=iso-8859-1=1;text/javascript=3 which is the Counter metadata for [/wikipedia_en_top_maxi_2025-06.zim I see following problems:

  • image/svg+xml; which has not =xyz part ... and we already have an entry image/svg+xml=8. This seems to be a bug in the libzim... but probably triggered by an incongruity given by MWoffliner

  • profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=67381; where profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0" does not look like a mime-type. Here I believe there is a bad handling of the mime-type parameter profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0". Here again it looks more like a bug in the libzim

It rather seems to me that this is a result of having a MIME-type string image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0" for most of (67381) SVGs. Similarly, there is one HTML page that comes with a MIME-type string of text/html; charset=iso-8859-1.

Yes, we have to remove all mime-types parameters.

kelson42 avatar Jul 03 '25 13:07 kelson42