mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Home page is missing some page when title passed to articleList is a redirect

Open benoit74 opened this issue 1 year ago • 8 comments

Sample command:

mwoffliner [email protected] --articleList="西班牙國_(1936年—1975年),Paris,Berlin" --customZimDescription="Test" --customZimTitle=Test --filenamePrefix=tests_en_mwoffliner --format=maxi --mwUrl=https://zh.wikipedia.org

Current result: the home page contains only the link to 西班牙國_(1936年—1975年), despite Paris (巴黎) and Berlin (柏林) being present inside the ZIM.

Image

Expected result: the home page should have 3 links with the 3 articles which have been retrieved

benoit74 avatar Apr 01 '25 12:04 benoit74

Very very old known bug... but was surprised that no issue was still open. See https://github.com/openzim/mwoffliner/issues/889 and https://github.com/openzim/mwoffliner/issues/938

kelson42 avatar Apr 02 '25 17:04 kelson42

@benoit74 can i work on this issue?

Sheryar-Ahmed avatar Apr 05 '25 09:04 Sheryar-Ahmed

I'm planning to fix an issue where redirected article titles in the list aren’t showing up properly. I’ll check if a title is a redirect using the API, follow it to the correct page, and make sure it gets included on the homepage. Then I’ll add a test to confirm it works.

Sheryar-Ahmed avatar Apr 05 '25 10:04 Sheryar-Ahmed

If you use --mwUrl=https://zh.wikipedia.org (Chinese Wikipedia), then article titles in Chinese (e.g., 巴黎 for Paris, 柏林 for Berlin) must be passed in the --articleList. But if you use English names like "Paris" or "Berlin" instead of "巴黎" or "柏林", they might still be included in the ZIM, but they won’t appear on the homepage because the page index is generated based on matching localized titles.

Image

yugalkaushik avatar Apr 06 '25 08:04 yugalkaushik

@yugalkaushik Yes, that's the issue. We want 巴黎 and 柏林 to appear on the home page even when Paris and Berlin are passed in the articleList. By feature, we support Paris and Berlin in article list, and unless you have sound arguments proving it is a very bad idea, we won't drop this feature.

benoit74 avatar Apr 07 '25 07:04 benoit74

Correct me if I'm wrong, but the content inside the articles for Paris and Berlin is in Chinese, right? So why would they want the articleList items to be in English? If they wrote something like '西班牙國_(1936年—1975年)' in Chinese, wouldn't they write 'Paris' and 'Berlin' in Chinese too?

yugalkaushik avatar Apr 07 '25 13:04 yugalkaushik

Correct me if I'm wrong, but the content inside the articles for Paris and Berlin is in Chinese, right? So why would they want the articleList items to be in English? If they wrote something like '西班牙國_(1936年—1975年)' in Chinese, wouldn't they write 'Paris' and 'Berlin' in Chinese too?

You are doing too many assumptions and "Paris" is clearly a redirect https://zh.wikipedia.org/w/index.php?title=Paris&redirect=no like "Berlin" https://zh.wikipedia.org/w/index.php?title=Berlin&redirect=no.

This issue is about redirects handling in article list, not about Chinese.

kelson42 avatar Apr 07 '25 13:04 kelson42

Got it. Will try to understand more about it.

yugalkaushik avatar Apr 07 '25 14:04 yugalkaushik

@benoit74 Hi , I'd like to tackle this issue.

What’s actually happening When we pass a list of article titles to the system, redirects and normalization are not properly handled.

The problem: After redirects, some titles aren’t resolved properly, so they never get indexed in the main page list. As a result, those articles don’t appear at all.

My idea I'll create a getFinalTitles() function that:

  • Hits the MediaWiki API to get both normalization + redirect mappings
    https://<wiki-domain>/w/api.php?action=query&titles=A|B|C&redirects&format=json for exampe https://zh.wikipedia.org/w/api.php?action=query&titles=Paris|Berlin&redirects&format=json
  • Returns a proper originalTitle → canonicalTitle map
  • Then use those canonical titles when looking up article metadata

This way: All redirects are resolved before we query for details. The main page will always display the correct articles.

Image Image

I've tried many cases on and it worked Quick question though - are we only dealing with MediaWiki wikis here? Because this solution uses /w/api.php?action=query&titles=...&redirects which is MediaWiki-specific. If there are other wiki engines involved, I'd need to handle those differently.

Let me know if I can get complete the idea or it is not applicable

ziaddevv avatar Sep 09 '25 17:09 ziaddevv

@ziaddevv please help us solving this issue

Looks like you understood the problem correctly and your fix produces expected results.

However your current implementation does not seem to be particularly efficient.

Making new API calls to the Mediawiki must be strongly justified:

  • each new API call is causing load on the upstream server, and we want to keep this to a minimum
  • each new API call is causing delay since network calls are known to be slow

And in this case, I don't feel like it is necessary. We already grab information about all redirects and store this in the Redis (see RedisStore.redirectsXId key-value store). I'm pretty sure we already have everything required here.

And yes, mwoffliner is only about Mediawikis

benoit74 avatar Sep 11 '25 08:09 benoit74

Downloader.getArticleDetailsIds() is already doing the exact API request when getting the article details for article lists. So we would just need to use the result of that function to update the article list for the main page.

Markus-Rost avatar Sep 11 '25 10:09 Markus-Rost

Thanks @benoit74 and @Markus-Rost for the guidance .

I refactored my original fix to avoid the extra MediaWiki API calls. Instead, I now use the existing redirectsXId entries already stored in Redis. The loop resolves each article ID through Redis, follows the redirect target if present, and then fetches the article details.

This keeps the logic minimal, avoids network overhead, and ensures redirected articles are properly included on the homepage.

Appreciate the feedback — it helped me simplify the fix a lot!

ziaddevv avatar Sep 11 '25 17:09 ziaddevv

Reading the code, it seems we don't handle many levels redirects. Do I get this correctly?

kelson42 avatar Sep 12 '25 09:09 kelson42

Reading the code, it seems we don't handle many levels redirects. Do I get this correctly?

Good question ; not sure how this is handled / presented at the API level

benoit74 avatar Sep 12 '25 09:09 benoit74

Many levels of redirects do not seem to really be supported on ~~Mediawiki~~ mwoffliner at all, I've opened https://github.com/openzim/mwoffliner/issues/2521 to track the issue.

benoit74 avatar Sep 12 '25 11:09 benoit74

Many levels of redirects do not seem to really be supported on Mediawiki at all, I've opened https://github.com/openzim/mwoffliner/issues/2521 to track the issue.

Not sure what you mean exactly with "supported", but the situation is quite common. It is enough to move (rename) twice or more an article to create it.

AFAIK, a few communities have bots to resolve asynchronously such situations... bur not something we can or should assume will be done.

kelson42 avatar Sep 12 '25 11:09 kelson42

Sorry, I wanted to say it is not supported by mwoffliner, not Mediawiki ...

benoit74 avatar Sep 12 '25 12:09 benoit74