mwoffliner
mwoffliner copied to clipboard
Troubleshooting minecraftwiki_zh_all recipe
Note: This is only tested on MWoffliner v1.13.0 (since all openZIM scrapers are using this version). Both the code and the config between v1.13.0 and git main differs a lot. So this needs to be tested on git main.
The following description is mostly taken from my comment when troubleshooting the scrape for Minecraft Wiki (zh) (openzim/zim-requests#755).
The scraper reports Unable to find appropriate API end-point to retrieve article HTML
when scraping Minecraft Wiki (zh). Here is a code analysis of MWoffliner v1.13.0.
Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId
) in Downloader.checkCapabilities
:
https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/Downloader.ts#L243-L263
The default value MediaWiki:Sidebar
is never used because the value of mwMetaData.mainPage
is passed:
https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/mwoffliner.lib.ts#L206
The value of mwMetaData.mainPage
comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)
https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L290-L325 https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L235-L279
This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:
// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",
There are two ways to fix this:
- ⭐ Set
mwMetaData.mainPage
toentries.mainpage
, which is already included in the API result. (MediaWiki documentation)-const mainPage = decodeURIComponent(entries.base.split('/').pop()) +const mainPage = entries.mainpage
- Use the default parameter for
Downloader.checkCapabilities
:-await downloader.checkCapabilities(mwMetaData.mainPage) +await downloader.checkCapabilities()
I have tested both, and both worked.
The following description is mostly taken from my comment.
In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:
-
Mobile REST API: Only available in Wikimedia REST API.
-
Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.
- In Wikimedia REST API, the URL is
/page/html/{title}
. - In MediaWiki REST API, the URL is
/page/{title}/html
.
In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later :frowning_face:, but it seems promising).
Besides, @xtexChooser inspired me to try Parsoid API, whose URL is
/rest.php/{domain}/v3/page/html/{title}
. However, this would be redirected to/rest.php/{domain}/v3/page/html/{title}/{latest_revision}
. Since the response code is 302, not 200, it is regarded as inaccessible. - In Wikimedia REST API, the URL is
-
VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.
I am currently testing git main.
@kelson42 switched to another scraper running git main. However, it failed because the arguments between v1.13.0 and git main differ. To fix this:
- Unset
--mwApiPath
- Set
--mwActionApiPath="api.php"
(NO LEADING SLASH)
The next issue I encountered after fixing this was:
[error] [2024-02-20T03:24:45.973Z] Failed to run mwoffliner after [65s]: {
"stack": "TypeError: articleListLines is not iterable\n at createMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:429:37)\n at getMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:466:54)\n at doDump (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:308:15)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async Module.execute (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:261:17)",
"message": "articleListLines is not iterable"
}
I modified mwoffliner.lib.js
to print out articleListLines
:
[log] [2024-02-20T03:42:06.004Z] articleListLines = undefined
Finally, I find out the cause of the issue: same as before.
https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L413-L428
Since the logic of retrieving main page remains unchanged, we still have to modify the code to make it work.
https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L203-L204 https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/mwoffliner.lib.ts#L609
In regular cases:
- When
--articleList
is set,mainPage
is set to empty, thencreateMainPage()
is called, which reads the value ofarticleList
. - When
--articleList
is not set,mainPage
is set tomwMetaData.mainPage
, thencreateMainPageRedirect()
is called.
However, in this situation, mwMetaData.mainPage
is empty, so that createMainPage()
is called, which leads to the error mentioned above.
@kelson42 Could you please create a pull request? (The solution is at the end of my first comment.)
Update: Checking API capabilities is no longer a problem in git main, since MediaWiki:Sidebar
is always used:
https://github.com/openzim/mwoffliner/blob/ad5dc1d6071552c1a9b577fa1a619cfba0fe6938/src/MediaWiki.ts#L162
@TripleCamera Thank you! I will have a look in rhe next days to your analysis.
@ TripleCamera Thank you! I will have a look in rhe next days to your analysis.
@kelson42 How is everything going?
I fixed the main page issue and started a scrape on my machine. Two problems arose:
-
Failed to retrieve "资源包/Folders", the longest page on this wiki. However, later tests showed that the second longest page ("Minecraft Dungeons:API") can be retrieved. See Special:LongPages.
So, we need to exclude "资源包/Folders".
Error log:
[info] [2024-03-07T03:17:02.060Z] Getting article [资源包/Folders] from https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders [info] [2024-03-07T03:17:02.061Z] Getting JSON from [https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders] [error] [2024-03-07T03:17:04.205Z] Error downloading article 资源包/Folders
API result:
{ "error": { "code": "visualeditor-docserver-http", "info": "Error contacting the Parsoid/RESTBase server (HTTP 500): (no message)", "docref": "See https://zh.minecraft.wiki/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes." }, "servedby": "mediawiki-6ff94dc64-5tmqz" }
-
After my scrape failed, someone told me that both the API and the site became slow for a while. I suspected that the scraper was too fast. So I checked the history of the minecraftwiki_zh_all recipe. Then I found that at first, argument
--speed
was set to"0.1"
, but later it was removed. I will add--speed
argument and try again.
@TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.
@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.
Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).
Here is a list of things I have done so far:
- Fix the main page issue in the code (See my first comment)
- Unset
--mwApiPath
- Set
--mwActionApiPath="api.php"
(NO LEADING SLASH) - ~~Set
--articleListToIgnore="资源包/Folders"
~~ This page has been deleted - Set
--speed
to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)
Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.
Any progress so far?
@kelson42
Great, Kelson is back. It seems that this task can move forward a little bit more. :blush:
Update: @kelson42 Hello?
@kelson42 Hi. Have you been busy recently? Maybe you can assign this task to your colleagues (if they are free).
Hi. I just created a pull request which contains the patch. Can someone review & merge it? @kelson42
NOT DONE: The mwApiPath => mwActionApiPath issue haven't been fixed.
Thank you, Winston! The scraper was launched again, but failed immediately. This is because some arguments are still not correct.
Here are the rest of the things to do (taken from my previous comment):
- [ ] Unset
--mwApiPath
- [ ] Set
--mwActionApiPath="api.php"
(NO LEADING SLASH) - [ ] Set
--speed
to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)
@kelson42 @audiodude Could you please reopen this issue? Thank you!
(please reopen this issue @audiodude .)
I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.
MediaWiki Special:Version, Action API, $wgArticlePath would contain/require the leading slash in the article path, so I guess it could be counted as a bug?
- https://www.mediawiki.org/wiki/Special:Version
- https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=general
- https://www.mediawiki.org/wiki/Manual:$wgArticlePath
I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.
Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?
@TripleCamera Would be really easier to have the full command/log.
I womder if this "no leading slash" should be considered as a bug. I would recommend to open an issue to discuss if this is a bug and if we should fix it.
Well, I'm not saying that this is a bug. The thing is that the scraper still has incorrect arguments. I'm not sure where to track this, should we track it here or at openzim/zim-requests#755, or somewhere else?
The issue should be tracked and discussed at https://github.com/openzim/zim-requests/issues/755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.
In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of mwoffliner
, such an issue should be tracked here.
It sounds like your remaining problems are all recipe/parameter related and not related to the code of mwoffliner
.
Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of mwApiPath
versus mwActionApiPath
. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.
@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.
Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).
Here is a list of things I have done so far:
- Fix the main page issue in the code (See my first comment)
- Unset
--mwApiPath
- Set
--mwActionApiPath="api.php"
(NO LEADING SLASH)- ~Set
--articleListToIgnore="资源包/Folders"
~ This page has been deleted- Set
--speed
to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.
To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.
Closing this issue as the scraping is reportedly working on HEAD. Please follow up on https://github.com/openzim/zim-requests/issues/755 to update/fix the live recipe.
The issue should be tracked and discussed at openzim/zim-requests#755 if it affects a specific ZIM recipe that has the wrong arguments, or is otherwise configured incorrectly.
In the scenario that it is not possible at all to configure the ZIM recipe correctly, because of limitations of
mwoffliner
, such an issue should be tracked here.It sounds like your remaining problems are all recipe/parameter related and not related to the code of
mwoffliner
.Finally, keep in mind that the code at main/HEAD of this repo is for dev/1.14 while the versions used for ZIM recipes is still 1.13. I think you are already aware of this because of
mwApiPath
versusmwActionApiPath
. It should be noted that getting the recipe to work locally on 1.14 will likely not help you debug the live recipe. However, you can always checkout the 1.13 tag (https://github.com/openzim/mwoffliner/tree/v1.13.0) and test there.
To follow up further: these changes to make the scraping process work only affect version dev/1.14 which is the code in main of this repo. The steps to make 1.13 work are likely different.
Closing this issue as the scraping is reportedly working on HEAD. Please follow up on openzim/zim-requests#755 to update/fix the live recipe.
OK. Thanks!