Home page is missing some page when title passed to articleList is a redirect
Sample command:
mwoffliner [email protected] --articleList="西班牙國_(1936年—1975年),Paris,Berlin" --customZimDescription="Test" --customZimTitle=Test --filenamePrefix=tests_en_mwoffliner --format=maxi --mwUrl=https://zh.wikipedia.org
Current result: the home page contains only the link to 西班牙國_(1936年—1975年), despite Paris (巴黎) and Berlin (柏林) being present inside the ZIM.
Expected result: the home page should have 3 links with the 3 articles which have been retrieved
Very very old known bug... but was surprised that no issue was still open. See https://github.com/openzim/mwoffliner/issues/889 and https://github.com/openzim/mwoffliner/issues/938
@benoit74 can i work on this issue?
I'm planning to fix an issue where redirected article titles in the list aren’t showing up properly. I’ll check if a title is a redirect using the API, follow it to the correct page, and make sure it gets included on the homepage. Then I’ll add a test to confirm it works.
If you use --mwUrl=https://zh.wikipedia.org (Chinese Wikipedia), then article titles in Chinese (e.g., 巴黎 for Paris, 柏林 for Berlin) must be passed in the --articleList. But if you use English names like "Paris" or "Berlin" instead of "巴黎" or "柏林", they might still be included in the ZIM, but they won’t appear on the homepage because the page index is generated based on matching localized titles.
@yugalkaushik Yes, that's the issue. We want 巴黎 and 柏林 to appear on the home page even when Paris and Berlin are passed in the articleList. By feature, we support Paris and Berlin in article list, and unless you have sound arguments proving it is a very bad idea, we won't drop this feature.
Correct me if I'm wrong, but the content inside the articles for Paris and Berlin is in Chinese, right? So why would they want the articleList items to be in English? If they wrote something like '西班牙國_(1936年—1975年)' in Chinese, wouldn't they write 'Paris' and 'Berlin' in Chinese too?
Correct me if I'm wrong, but the content inside the articles for Paris and Berlin is in Chinese, right? So why would they want the articleList items to be in English? If they wrote something like '西班牙國_(1936年—1975年)' in Chinese, wouldn't they write 'Paris' and 'Berlin' in Chinese too?
You are doing too many assumptions and "Paris" is clearly a redirect https://zh.wikipedia.org/w/index.php?title=Paris&redirect=no like "Berlin" https://zh.wikipedia.org/w/index.php?title=Berlin&redirect=no.
This issue is about redirects handling in article list, not about Chinese.
Got it. Will try to understand more about it.
@benoit74 Hi , I'd like to tackle this issue.
What’s actually happening When we pass a list of article titles to the system, redirects and normalization are not properly handled.
The problem: After redirects, some titles aren’t resolved properly, so they never get indexed in the main page list. As a result, those articles don’t appear at all.
My idea
I'll create a getFinalTitles() function that:
- Hits the MediaWiki API to get both normalization + redirect mappings
https://<wiki-domain>/w/api.php?action=query&titles=A|B|C&redirects&format=jsonfor exampehttps://zh.wikipedia.org/w/api.php?action=query&titles=Paris|Berlin&redirects&format=json - Returns a proper
originalTitle → canonicalTitlemap - Then use those canonical titles when looking up article metadata
This way: All redirects are resolved before we query for details. The main page will always display the correct articles.
I've tried many cases on and it worked
Quick question though - are we only dealing with MediaWiki wikis here? Because this solution uses /w/api.php?action=query&titles=...&redirects which is MediaWiki-specific. If there are other wiki engines involved, I'd need to handle those differently.
Let me know if I can get complete the idea or it is not applicable
@ziaddevv please help us solving this issue
Looks like you understood the problem correctly and your fix produces expected results.
However your current implementation does not seem to be particularly efficient.
Making new API calls to the Mediawiki must be strongly justified:
- each new API call is causing load on the upstream server, and we want to keep this to a minimum
- each new API call is causing delay since network calls are known to be slow
And in this case, I don't feel like it is necessary. We already grab information about all redirects and store this in the Redis (see RedisStore.redirectsXId key-value store). I'm pretty sure we already have everything required here.
And yes, mwoffliner is only about Mediawikis
Downloader.getArticleDetailsIds() is already doing the exact API request when getting the article details for article lists. So we would just need to use the result of that function to update the article list for the main page.
Thanks @benoit74 and @Markus-Rost for the guidance .
I refactored my original fix to avoid the extra MediaWiki API calls. Instead, I now use the existing redirectsXId entries already stored in Redis. The loop resolves each article ID through Redis, follows the redirect target if present, and then fetches the article details.
This keeps the logic minimal, avoids network overhead, and ensures redirected articles are properly included on the homepage.
Appreciate the feedback — it helped me simplify the fix a lot!
Reading the code, it seems we don't handle many levels redirects. Do I get this correctly?
Reading the code, it seems we don't handle many levels redirects. Do I get this correctly?
Good question ; not sure how this is handled / presented at the API level
Many levels of redirects do not seem to really be supported on ~~Mediawiki~~ mwoffliner at all, I've opened https://github.com/openzim/mwoffliner/issues/2521 to track the issue.
Many levels of redirects do not seem to really be supported on Mediawiki at all, I've opened https://github.com/openzim/mwoffliner/issues/2521 to track the issue.
Not sure what you mean exactly with "supported", but the situation is quite common. It is enough to move (rename) twice or more an article to create it.
AFAIK, a few communities have bots to resolve asynchronously such situations... bur not something we can or should assume will be done.
Sorry, I wanted to say it is not supported by mwoffliner, not Mediawiki ...