mwoffliner Troubleshooting minecraftwiki_zh

Note: This is only tested on MWoffliner v1.13.0 (since all openZIM scrapers are using this version). Both the code and the config between v1.13.0 and git main differs a lot. So this needs to be tested on git main.

The following description is mostly taken from my comment when troubleshooting the scrape for Minecraft Wiki (zh) (openzim/zim-requests#755).

The scraper reports Unable to find appropriate API end-point to retrieve article HTML when scraping Minecraft Wiki (zh). Here is a code analysis of MWoffliner v1.13.0.

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/Downloader.ts#L243-L263

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/mwoffliner.lib.ts#L206

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L290-L325 https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L235-L279

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

There are two ways to fix this:

⭐ Set mwMetaData.mainPage to entries.mainpage, which is already included in the API result. (MediaWiki documentation)
```
-const mainPage = decodeURIComponent(entries.base.split('/').pop())
+const mainPage = entries.mainpage
```

Use the default parameter for Downloader.checkCapabilities:

-await downloader.checkCapabilities(mwMetaData.mainPage)
+await downloader.checkCapabilities()

I have tested both, and both worked.

Feb 14 '24 12:02 TripleCamera

The following description is mostly taken from my comment.

In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:

Mobile REST API: Only available in Wikimedia REST API.
Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.
- In Wikimedia REST API, the URL is /page/html/{title}.
- In MediaWiki REST API, the URL is /page/{title}/html.
In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later :frowning_face:, but it seems promising).

Besides, @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.
VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.

Feb 20 '24 03:02 TripleCamera

I am currently testing git main.

@kelson42 switched to another scraper running git main. However, it failed because the arguments between v1.13.0 and git main differ. To fix this:

Unset --mwApiPath
Set --mwActionApiPath="api.php" (NO LEADING SLASH)

The next issue I encountered after fixing this was:

[error] [2024-02-20T03:24:45.973Z] Failed to run mwoffliner after [65s]: {
	"stack": "TypeError: articleListLines is not iterable\n    at createMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:429:37)\n    at getMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:466:54)\n    at doDump (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:308:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Module.execute (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:261:17)",
	"message": "articleListLines is not iterable"
}

I modified mwoffliner.lib.js to print out articleListLines:

[log] [2024-02-20T03:42:06.004Z] articleListLines = undefined

Feb 20 '24 03:02 TripleCamera

Finally, I find out the cause of the issue: same as before.

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L413-L428

Since the logic of retrieving main page remains unchanged, we still have to modify the code to make it work.

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L203-L204 https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L609

In regular cases:

When --articleList is set, mainPage is set to empty, then createMainPage() is called, which reads the value of articleList.
When --articleList is not set, mainPage is set to mwMetaData.mainPage, then createMainPageRedirect() is called.

However, in this situation, mwMetaData.mainPage is empty, so that createMainPage() is called, which leads to the error mentioned above.

@kelson42 Could you please create a pull request? (The solution is at the end of my first comment.)

Update: Checking API capabilities is no longer a problem in git main, since MediaWiki:Sidebar is always used:

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L162

Feb 26 '24 14:02 TripleCamera

@TripleCamera Thank you! I will have a look in rhe next days to your analysis.

Feb 26 '24 14:02 kelson42

@ TripleCamera Thank you! I will have a look in rhe next days to your analysis.

@kelson42 How is everything going?

Mar 03 '24 10:03 TripleCamera

I fixed the main page issue and started a scrape on my machine. Two problems arose:

Failed to retrieve "资源包/Folders", the longest page on this wiki. However, later tests showed that the second longest page ("Minecraft Dungeons:API") can be retrieved. See Special:LongPages.

So, we need to exclude "资源包/Folders".

Error log:

[info] [2024-03-07T03:17:02.060Z] Getting article [资源包/Folders] from https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders
[info] [2024-03-07T03:17:02.061Z] Getting JSON from [https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders]
[error] [2024-03-07T03:17:04.205Z] Error downloading article 资源包/Folders

API result:

{
    "error": {
        "code": "visualeditor-docserver-http",
        "info": "Error contacting the Parsoid/RESTBase server (HTTP 500): (no message)",
        "docref": "See https://zh.minecraft.wiki/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
    },
    "servedby": "mediawiki-6ff94dc64-5tmqz"
}

After my scrape failed, someone told me that both the API and the site became slow for a while. I suspected that the scraper was too fast. So I checked the history of the minecraftwiki_zh_all recipe. Then I found that at first, argument --speed was set to "0.1", but later it was removed. I will add --speed argument and try again.

Mar 07 '24 06:03 TripleCamera

@TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Mar 07 '24 06:03 kelson42

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

Fix the main page issue in the code (See my first comment)
Unset --mwApiPath
Set --mwActionApiPath="api.php" (NO LEADING SLASH)
~~Set --articleListToIgnore="资源包/Folders"~~ This page has been deleted
Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

Mar 09 '24 12:03 TripleCamera

Any progress so far?

@kelson42

Mar 28 '24 07:03 winstonsung

Great, Kelson is back. It seems that this task can move forward a little bit more. :blush:

Update: @kelson42 Hello?

Apr 09 '24 14:04 TripleCamera

@kelson42 Hi. Have you been busy recently? Maybe you can assign this task to your colleagues (if they are free).

Apr 22 '24 04:04 TripleCamera

Hi. I just created a pull request which contains the patch. Can someone review & merge it? @kelson42

May 11 '24 01:05 TripleCamera

NOT DONE: The mwApiPath => mwActionApiPath issue haven't been fixed.

Jul 14 '24 16:07 winstonsung

Thank you, Winston! The scraper was launched again, but failed immediately. This is because some arguments are still not correct.

Here are the rest of the things to do (taken from my previous comment):

[ ] Unset --mwApiPath
[ ] Set --mwActionApiPath="api.php" (NO LEADING SLASH)
[ ] Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

@kelson42 @audiodude Could you please reopen this issue? Thank you!

Jul 15 '24 03:07 TripleCamera

(please reopen this issue @audiodude .）

Jul 22 '24 05:07 winstonsung

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

Jul 22 '24 07:07 kelson42

MediaWiki Special:Version, Action API, $wgArticlePath would contain/require the leading slash in the article path, so I guess it could be counted as a bug?

https://www.mediawiki.org/wiki/Special:Version
https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=general
https://www.mediawiki.org/wiki/Manual:$wgArticlePath

Jul 22 '24 07:07 winstonsung

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?

Jul 22 '24 11:07 TripleCamera

@TripleCamera Would be really easier to have the full command/log.

Jul 22 '24 13:07 kelson42

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?

The issue should be tracked and discussed at https://github.com/openzim/zim-requests/issues/755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.

In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of mwoffliner, such an issue should be tracked here.

It sounds like your remaining problems are all recipe/parameter related and not related to the code of mwoffliner.

Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of mwApiPath versus mwActionApiPath. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.

Jul 22 '24 15:07 audiodude

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

Fix the main page issue in the code (See my first comment)

Unset --mwApiPath

Set --mwActionApiPath="api.php" (NO LEADING SLASH)

~Set --articleListToIgnore="资源包/Folders"~ This page has been deleted

Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.

Closing this issue as the scraping is reportedly working on HEAD. Please follow up on https://github.com/openzim/zim-requests/issues/755 to update/fix the live recipe.

Jul 22 '24 15:07 audiodude

The issue should be tracked and discussed at openzim/zim-requests#755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.

In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of mwoffliner, such an issue should be tracked here.

It sounds like your remaining problems are all recipe/parameter related and not related to the code of mwoffliner.

Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of mwApiPath versus mwActionApiPath. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.

To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.

Closing this issue as the scraping is reportedly working on HEAD. Please follow up on openzim/zim-requests#755 to update/fix the live recipe.

OK. Thanks!

Jul 23 '24 00:07 TripleCamera

mwoffliner
mwoffliner copied to clipboard

Troubleshooting minecraftwiki_zh_all recipe

mwoffliner mwoffliner copied to clipboard

Troubleshooting minecraftwiki_zh_all recipe

mwoffliner
mwoffliner copied to clipboard