mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Troubleshooting minecraftwiki_zh_all recipe

Open TripleCamera opened this issue 1 year ago • 12 comments

Note: This is only tested on MWoffliner v1.13.0 (since all openZIM scrapers are using this version). Both the code and the config between v1.13.0 and git main differs a lot. So this needs to be tested on git main.

The following description is mostly taken from my comment when troubleshooting the scrape for Minecraft Wiki (zh) (openzim/zim-requests#755).


The scraper reports Unable to find appropriate API end-point to retrieve article HTML when scraping Minecraft Wiki (zh). Here is a code analysis of MWoffliner v1.13.0.

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/Downloader.ts#L243-L263

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/mwoffliner.lib.ts#L206

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L290-L325 https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L235-L279

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

There are two ways to fix this:

  1. ⭐ Set mwMetaData.mainPage to entries.mainpage, which is already included in the API result. (MediaWiki documentation)
    -const mainPage = decodeURIComponent(entries.base.split('/').pop())
    +const mainPage = entries.mainpage
    
  2. Use the default parameter for Downloader.checkCapabilities:
    -await downloader.checkCapabilities(mwMetaData.mainPage)
    +await downloader.checkCapabilities()
    

I have tested both, and both worked.

TripleCamera avatar Feb 14 '24 12:02 TripleCamera

The following description is mostly taken from my comment.


In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:

  • Mobile REST API: Only available in Wikimedia REST API.

  • Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.

    In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later :frowning_face:, but it seems promising). 屏幕截图 2024-02-14 215918

    Besides, @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.

  • VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.

TripleCamera avatar Feb 20 '24 03:02 TripleCamera

I am currently testing git main.

@kelson42 switched to another scraper running git main. However, it failed because the arguments between v1.13.0 and git main differ. To fix this:

  1. Unset --mwApiPath
  2. Set --mwActionApiPath="api.php" (NO LEADING SLASH)

The next issue I encountered after fixing this was:

[error] [2024-02-20T03:24:45.973Z] Failed to run mwoffliner after [65s]: {
	"stack": "TypeError: articleListLines is not iterable\n    at createMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:429:37)\n    at getMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:466:54)\n    at doDump (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:308:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Module.execute (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:261:17)",
	"message": "articleListLines is not iterable"
}

I modified mwoffliner.lib.js to print out articleListLines:

[log] [2024-02-20T03:42:06.004Z] articleListLines = undefined

TripleCamera avatar Feb 20 '24 03:02 TripleCamera

Finally, I find out the cause of the issue: same as before.

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L413-L428

Since the logic of retrieving main page remains unchanged, we still have to modify the code to make it work.

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L203-L204 https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L609

In regular cases:

  • When --articleList is set, mainPage is set to empty, then createMainPage() is called, which reads the value of articleList.
  • When --articleList is not set, mainPage is set to mwMetaData.mainPage, then createMainPageRedirect() is called.

However, in this situation, mwMetaData.mainPage is empty, so that createMainPage() is called, which leads to the error mentioned above.

@kelson42 Could you please create a pull request? (The solution is at the end of my first comment.)


Update: Checking API capabilities is no longer a problem in git main, since MediaWiki:Sidebar is always used:

https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L162

TripleCamera avatar Feb 26 '24 14:02 TripleCamera

@TripleCamera Thank you! I will have a look in rhe next days to your analysis.

kelson42 avatar Feb 26 '24 14:02 kelson42

@ TripleCamera Thank you! I will have a look in rhe next days to your analysis.

@kelson42 How is everything going?

TripleCamera avatar Mar 03 '24 10:03 TripleCamera

I fixed the main page issue and started a scrape on my machine. Two problems arose:

  1. Failed to retrieve "资源包/Folders", the longest page on this wiki. However, later tests showed that the second longest page ("Minecraft Dungeons:API") can be retrieved. See Special:LongPages.

    So, we need to exclude "资源包/Folders".

    Error log:

    [info] [2024-03-07T03:17:02.060Z] Getting article [资源包/Folders] from https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders
    [info] [2024-03-07T03:17:02.061Z] Getting JSON from [https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders]
    [error] [2024-03-07T03:17:04.205Z] Error downloading article 资源包/Folders
    

    API result:

    {
        "error": {
            "code": "visualeditor-docserver-http",
            "info": "Error contacting the Parsoid/RESTBase server (HTTP 500): (no message)",
            "docref": "See https://zh.minecraft.wiki/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
        },
        "servedby": "mediawiki-6ff94dc64-5tmqz"
    }
    
  2. After my scrape failed, someone told me that both the API and the site became slow for a while. I suspected that the scraper was too fast. So I checked the history of the minecraftwiki_zh_all recipe. Then I found that at first, argument --speed was set to "0.1", but later it was removed. I will add --speed argument and try again.

TripleCamera avatar Mar 07 '24 06:03 TripleCamera

@TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

kelson42 avatar Mar 07 '24 06:03 kelson42

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

  • Fix the main page issue in the code (See my first comment)
  • Unset --mwApiPath
  • Set --mwActionApiPath="api.php" (NO LEADING SLASH)
  • ~~Set --articleListToIgnore="资源包/Folders"~~ This page has been deleted
  • Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

TripleCamera avatar Mar 09 '24 12:03 TripleCamera

Any progress so far?

@kelson42

winstonsung avatar Mar 28 '24 07:03 winstonsung

Great, Kelson is back. It seems that this task can move forward a little bit more. :blush:


Update: @kelson42 Hello?

TripleCamera avatar Apr 09 '24 14:04 TripleCamera

@kelson42 Hi. Have you been busy recently? Maybe you can assign this task to your colleagues (if they are free).

TripleCamera avatar Apr 22 '24 04:04 TripleCamera

Hi. I just created a pull request which contains the patch. Can someone review & merge it? @kelson42

TripleCamera avatar May 11 '24 01:05 TripleCamera

NOT DONE: The mwApiPath => mwActionApiPath issue haven't been fixed.

winstonsung avatar Jul 14 '24 16:07 winstonsung

Thank you, Winston! The scraper was launched again, but failed immediately. This is because some arguments are still not correct.

Here are the rest of the things to do (taken from my previous comment):

  • [ ] Unset --mwApiPath
  • [ ] Set --mwActionApiPath="api.php" (NO LEADING SLASH)
  • [ ] Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

@kelson42 @audiodude Could you please reopen this issue? Thank you!

TripleCamera avatar Jul 15 '24 03:07 TripleCamera

(please reopen this issue @audiodude .)

winstonsung avatar Jul 22 '24 05:07 winstonsung

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

kelson42 avatar Jul 22 '24 07:07 kelson42

MediaWiki Special:Version, Action API, $wgArticlePath would contain/require the leading slash in the article path, so I guess it could be counted as a bug?

  • https://www.mediawiki.org/wiki/Special:Version
  • https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=general
  • https://www.mediawiki.org/wiki/Manual:$wgArticlePath

winstonsung avatar Jul 22 '24 07:07 winstonsung

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?

TripleCamera avatar Jul 22 '24 11:07 TripleCamera

@TripleCamera Would be really easier to have the full command/log.

kelson42 avatar Jul 22 '24 13:07 kelson42

I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.

Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?

The issue should be tracked and discussed at https://github.com/openzim/zim-requests/issues/755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.

In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of mwoffliner, such an issue should be tracked here.

It sounds like your remaining problems are all recipe/parameter related and not related to the code of mwoffliner.

Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of mwApiPath versus mwActionApiPath. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.

audiodude avatar Jul 22 '24 15:07 audiodude

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

  • Fix the main page issue in the code (See my first comment)
  • Unset --mwApiPath
  • Set --mwActionApiPath="api.php" (NO LEADING SLASH)
  • ~Set --articleListToIgnore="资源包/Folders"~ This page has been deleted
  • Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.

Closing this issue as the scraping is reportedly working on HEAD. Please follow up on https://github.com/openzim/zim-requests/issues/755 to update/fix the live recipe.

audiodude avatar Jul 22 '24 15:07 audiodude

The issue should be tracked and discussed at openzim/zim-requests#755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.

In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of mwoffliner, such an issue should be tracked here.

It sounds like your remaining problems are all recipe/parameter related and not related to the code of mwoffliner.

Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of mwApiPath versus mwActionApiPath. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.

To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.

Closing this issue as the scraping is reportedly working on HEAD. Please follow up on openzim/zim-requests#755 to update/fix the live recipe.

OK. Thanks!

TripleCamera avatar Jul 23 '24 00:07 TripleCamera