zim-requests icon indicating copy to clipboard operation
zim-requests copied to clipboard

New request: Minecraft Wiki (zh)

Open TripleCamera opened this issue 2 years ago • 41 comments

Please use the following format for a ZIM creation request (and delete unnecessary information)

TripleCamera avatar Dec 02 '23 07:12 TripleCamera

Hi. May I ask how long it usually takes to fulfill a request? Some of our readers suffer from poor Internet connection, and the offline version might be the only solution.

TripleCamera avatar Jan 05 '24 07:01 TripleCamera

Hi, the recipe is created https://farm.openzim.org/recipes/minecraftwiki_zh_all I'll update the library link here once ready

RavanJAltaie avatar Jan 09 '24 13:01 RavanJAltaie

@RavanJAltaie Thank you! That's fast as lightning!

Unfortunately, if nothing went wrong, something would go wrong. The latest log said that there was a 404 when accessing https://zh.minecraft.wiki/w/api.php?action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc.

This is because the API path is /api.php, not the default /w/api.php. For more information, please check out Special:Version.

TripleCamera avatar Jan 09 '24 13:01 TripleCamera

the language should be zh instead of nan

xtexx avatar Jan 09 '24 14:01 xtexx

@xtexChooser @TripleCamera Thanks for your notes, all fixed in the recipe, I re-run it & will follow up.

RavanJAltaie avatar Jan 17 '24 23:01 RavanJAltaie

@RavanJAltaie The language is correct now. However, the value of mwApiPath is not correct. Please change it to /api.php, thanks.

TripleCamera avatar Jan 18 '24 10:01 TripleCamera

@RavanJAltaie Good news: I just set up the docker environment used by openZIM scrapers. I am importing the config used by the scraper. Then I will try to fix the errors on my machine. I will posts a list of corrected arguments once I finish.

Update: Here is the script:

#!/bin/bash

# Usage: sudo ./run.sh

# For docker:
#     Added: --rm
#     Modified: -v
#     Removed: --detach, --cpu-shares, --memory-swappiness, --memory
# For mwoffliner:
#     Modified: --adminEmail, --customZimDescription
#     Removed: --optimisationCacheUrl, --osTmpDir
docker run \
    -v /home/co-eda/mwoffliner-docker/output:/output:rw \
    --name mwoffliner_minecraftwiki_zh_all \
    --rm \
    ghcr.io/openzim/mwoffliner:1.13.0 \
    mwoffliner \
    --adminEmail="[email protected]" \
    --customZimDescription="Docker test" \
    --customZimFavicon="https://zh.minecraft.wiki/images/Wiki2x.png" \
    --customZimLanguage="zho" \
    --customZimTitle="Minecraft Wiki (zh)" \
    --format="novid:maxi" \
    --mwApiPath="/api.php" \
    --mwUrl="https://zh.minecraft.wiki/" \
    --outputDirectory="/output" \
    --publisher="openZIM" \
    --webp

TripleCamera avatar Feb 11 '24 04:02 TripleCamera

@RavanJAltaie TL;DR Please set --customZimFavicon to https://zh.minecraft.wiki/images/Wiki%402x.png, thanks.


I saw that the value of --mwApiPath had been changed to /api.php. However, at the same time, the %40 character in --customZimFavicon had been removed by someone. Please add it back.

The next issue I encountered after fixing this was:

Unable to find appropriate API end-point to retrieve article HTML

I am still investigating about this.

TripleCamera avatar Feb 13 '24 09:02 TripleCamera

I found the cause of Unable to find appropriate API end-point to retrieve article HTML. Here is a code analysis of MWoffliner v1.13.0 (since all the scrapers are using it).

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE ~~REST~~ API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/Downloader.ts#L243-L263

  public async checkCapabilities(testArticleId = 'MediaWiki:Sidebar'): Promise<void> {
    // By default check all API's responses and set the capabilities
    // accordingly. We need to set a default page (always there because
    // installed per default) to request the REST API, otherwise it would
    // fail the check.
    this.mwCapabilities.mobileRestApiAvailable = await this.checkApiAvailabilty(this.mw.getMobileRestApiArticleUrl(testArticleId))
    this.mwCapabilities.desktopRestApiAvailable = await this.checkApiAvailabilty(this.mw.getDesktopRestApiArticleUrl(testArticleId))
    this.mwCapabilities.veApiAvailable = await this.checkApiAvailabilty(this.mw.getVeApiArticleUrl(testArticleId))
    this.mwCapabilities.apiAvailable = await this.checkApiAvailabilty(this.mw.apiUrl.href)

    // Coordinate fetching
    // [...]
  }

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/mwoffliner.lib.ts#L206

  await downloader.checkCapabilities(mwMetaData.mainPage)

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L290-L325

  public async getMwMetaData(downloader: Downloader): Promise<MWMetaData> {
    if (this.metaData) {
      return this.metaData
    }

    const creator = this.getCreatorName() || 'Kiwix'

    const [textDir, { langIso2, langIso3, mainPage, siteName }, subTitle] = await Promise.all([
      this.getTextDirection(downloader),
      this.getSiteInfo(downloader),
      this.getSubTitle(downloader),
    ])

    const mwMetaData: MWMetaData = {
      // [...]
      mainPage,
    }

    this.metaData = mwMetaData

    return mwMetaData
  }

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L235-L279

  public async getSiteInfo(downloader: Downloader) {
    logger.log('Getting site info...')
    const query = 'action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc'
    const body = await downloader.query(query)
    const entries = body.query.general

    // Checking mediawiki version
    const mwVersion = semver.coerce(entries.generator).raw
    const mwMinimalVersion = 1.27
    if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) {
      throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`)
    }

    // Base will contain the default encoded article id for the wiki.
    const mainPage = decodeURIComponent(entries.base.split('/').pop())
    const siteName = entries.sitename

    // [...]

    return {
      mainPage,
      siteName,
      langIso2,
      langIso3,
    }
  }

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

Currently I don't know how to fix this. Do you have any ideas?

TripleCamera avatar Feb 14 '24 06:02 TripleCamera

Currently I don't know how to fix this. Do you have any ideas?

I think you should open a ticket at mwoffliner referencing your comment.

rgaudin avatar Feb 14 '24 08:02 rgaudin

I have fixed the recipe - which was wrongly configured - earlier today. We have to document how to configure mwoffliner properly! But no (visual editor) API is available. I have tried with version 1.14 (still in dev), which have more API end-point support, but I'm not over with this.

kelson42 avatar Feb 14 '24 08:02 kelson42

I think you should open a ticket at mwoffliner referencing your comment.

Okay, I just opened openzim/mwoffliner#1995.

Both the code and the config between v1.13.0 and git main differs a lot. So I need to alter config and test this on git main.

I don't know if this issue can be fixed without modifying code. The worst case would be switching to git main. :frowning_face:

TripleCamera avatar Feb 14 '24 12:02 TripleCamera

I have fixed the recipe - which was wrongly configured - earlier today. We have to document how to configure mwoffliner properly! But no (visual editor) API is available. I have tried with version 1.14 (still in dev), which have more API end-point support, but I'm not over with this.

Thank you! However, the config between v1.13.0 and git main differs, so you need to rewrite config to make it work.

In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:

  • Mobile REST API: Only available in Wikimedia REST API.

  • Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.

    In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later :frowning_face:, but it seems promising). 屏幕截图 2024-02-14 215918

  • VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.


Update: @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. So I set --mwRestApiPath="/rest.php/zh.minecraft.wiki/v3/page/html". However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.

TripleCamera avatar Feb 14 '24 14:02 TripleCamera

Upstream? All right, I will post my progress in the upstream issue.

TripleCamera avatar Feb 20 '24 03:02 TripleCamera

I'm back. openzim/mwoffliner#1995 has been fixed, which enables MWoffliner to scrape MCW-zh. However, the recipe still fails due to incorrect arguments.

@RavanJAltaie Hi. Could you please fix the recipe? The steps are:

  • [ ] Unset --mwApiPath
  • [ ] Set --mwActionApiPath="api.php" (NO LEADING SLASH)
  • [ ] Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

TripleCamera avatar Jul 23 '24 00:07 TripleCamera

Can someone remove the "Upstream" label and reassign @RavanJAltaie? Thanks.

TripleCamera avatar Aug 06 '24 03:08 TripleCamera

Hi. Excuse me, @RavanJAltaie. Is it possible to continue with this issue? It's been stalled for two months.

Tip: This issue has been created almost a year ago, so it's very far from the top of the list. You may use sort:updated-desc in the search bar to sort by last reply.

TripleCamera avatar Oct 16 '24 07:10 TripleCamera

@TripleCamera This wiki will require using mediawiki offliner, which is still being revamped. I'm told release is set to be soon, at which point this request is likely to move pretty fast along the queue. Until then, nothing that we can do ¯_(ツ)_/¯

Popolechien avatar Oct 16 '24 08:10 Popolechien

@TripleCamera This wiki will require using mediawiki offliner, which is still being revamped. I'm told release is set to be soon, at which point this request is likely to move pretty fast along the queue. Until then, nothing that we can do ¯(ツ)

OK, that's a sad story. :frowning_face: Thank you for your explanation.

TripleCamera avatar Oct 16 '24 08:10 TripleCamera

It seems that mwoffliner has released 1.14. Can we fulfill this request now?

TripleCamera avatar Mar 12 '25 02:03 TripleCamera

TbC ; pipe of zim-requests is still significant, including many mwoffliner tasks, might take some time until we can work on this issue, but definitely still on our radar

You can see where this issue sits in the pipe at https://github.com/openzim/zim-requests/issues?q=is%3Aissue%20state%3Aopen%20label%3AMediawiki%20sort%3Acreated-asc%20-label%3AUpstream%20 (we currently process them in chronological order)

benoit74 avatar Mar 13 '25 07:03 benoit74

TbC ; pipe of zim-requests is still significant, including many mwoffliner tasks, might take some time until we can work on this issue, but definitely still on our radar

You can see where this issue sits in the pipe at https://github.com/openzim/zim-requests/issues?q=is%3Aissue%20state%3Aopen%20label%3AMediawiki%20sort%3Acreated-asc%20-label%3AUpstream%20 (we currently process them in chronological order)

Thank you! However, this issue currently isn't in the pipe because it has the "Upstream" tag. Please remove it.

TripleCamera avatar Mar 13 '25 13:03 TripleCamera

Indeed, thanks

benoit74 avatar Mar 13 '25 14:03 benoit74

Good news: I created a full ZIM last night. I just uploaded it to my OneDrive, it is located under the mcwzh_20250723 folder.

It seems that a lot of bugs has been reported inside the English wiki issue, but anyway, feel free to try it out.

TripleCamera avatar Jul 23 '25 02:07 TripleCamera

If you used the main branch on mwoffliner, then most of the issues reported on the English wiki issue should already be fixed.

So please make sure to report any issues you notice.

Markus-Rost avatar Jul 23 '25 02:07 Markus-Rost

If you used the main branch on mwoffliner, then most of the issues reported on the English wiki issue should already be fixed.

So please make sure to report any issues you notice.

Oops, I'm still using 1.15.0. That's too old! I didn't even realize that 1.16.0 was released two weeks ago.

I will switch to the main branch next time.

TripleCamera avatar Jul 23 '25 03:07 TripleCamera

Oh no. It failed.

[error] [2025-07-23T16:30:07.461Z] This is a fatal download error, aborting
[error] [2025-07-23T16:30:07.461Z] Error downloading/rendering article Microsoft
[error] [2025-07-23T16:30:07.464Z] Failed to run mwoffliner after [2183s]:
 {
  name: 'Error',
  message: 'getaddrinfo EAI_AGAIN zh.minecraft.wiki',
  url: 'https://zh.minecraft.wiki/api.php?action=parse&format=json&prop=modules%7Cjsconfigvars%7Cheadhtml%7Ctext%7Cdisplaytitle%7Csubtitle&usearticle=1&disableeditsection=1&disablelimitreport=1&page=Microsoft&useskin=vector&variant=zh-cn&redirects=1&formatversion=2',
  status: undefined,
  responseType: 'json',
  data: undefined
}

TripleCamera avatar Jul 24 '25 01:07 TripleCamera

It seems to be your DNS server's fault.

xtexx avatar Jul 24 '25 02:07 xtexx

It seems to be your DNS server's fault.

That's weird. I will try again tonight.

TripleCamera avatar Jul 24 '25 12:07 TripleCamera

I just created another ZIM file using the latest MWoffliner, and placed it inside the mcwzh_20250725 folder. Please try it out.

TripleCamera avatar Jul 25 '25 06:07 TripleCamera