mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Introduce MediaWiki Parsoid API to render articles

Open VadimKovalenkoSNF opened this issue 1 year ago • 7 comments

Fixes https://phabricator.wikimedia.org/T324866

Superseeds #1846

VadimKovalenkoSNF avatar Sep 04 '23 18:09 VadimKovalenkoSNF

@VadimKovalenkoSNF I'm a bit lost, can you please tell:

  • Which MWOffliner ticket it fixes
  • What the approaches to fix it

Any chance to get that completed today?

kelson42 avatar Sep 06 '23 06:09 kelson42

@kelson42 , I haven't noticed dedicated ticket for mwoffliner. This patch mostly replicates functionality from https://github.com/openzim/mwoffliner/pull/1846 but on top of recent changes. In fact, it solves both problems - reduces traffic to MW infrastructure + allows mwoffliner to avoid redundant request to get modules per article if Wiki supports action API with parsoid=1 parameter. Do you want me to open the ticket here as well? This can be fixed today, I need to take a look at some broken tests. Note, that I still didn't get the answer to https://phabricator.wikimedia.org/T324866#9139172 but my solution seems to work without useparsoid.

VadimKovalenkoSNF avatar Sep 06 '23 06:09 VadimKovalenkoSNF

No real need to open an issue our side... glad if i can review tge PR soon.

kelson42 avatar Sep 06 '23 08:09 kelson42

Any chance to get that completed today?

Update: I've noticed that parsoid API has troubles with media treatment, and probably other issues in the output. Compare the example of https://en.m.wikipedia.org/wiki/User:Kelson/MWoffliner_CI_reference on the screenshot below. On the left side is the output of WikimediaDesktop and on the right side - MediaWiki Parsoid. Debugging these treatments might take some time, at least I can try to figure out how to enable missing media in the gallery, etc.

media-content-user-page

VadimKovalenkoSNF avatar Sep 06 '23 10:09 VadimKovalenkoSNF

@VadimKovalenkoSNF can you please explain in description the principle of your PR because i don't get it. It shoukd be onkt about adding "parsoid" to an url... and now we talk about something very different.

kelson42 avatar Sep 06 '23 12:09 kelson42

@kelson42 This PR introduces new renderer based on parsoid=1 in the MediaWiki Action API. Instead of WikimediaDesktop that represented by this example https://en.wikipedia.org/api/rest_v1/page/html/Foobar, mwoffliner will query this endpoint: https://en.m.wikipedia.org/w/api.php?action=parse&format=json&prop=text%7Cmodules%7Cjsconfigvars%7Cheadhtml&parsoid=1&page=Foobar

As you can see, it has text property with article HTML as well as headhtml, modules, modulescripts, modulestyles and jsconfigvars. Having these properties in the single response will prevent mwoffliner from triggering additional request to get modules.

The problem I denoted is about different article html from MediaWiki Action API and WikimediaDesktop response which in result will lead to the different output (missing media, etc)

VadimKovalenkoSNF avatar Sep 06 '23 12:09 VadimKovalenkoSNF

@VadimKovalenkoSNF Needs to be rebased

kelson42 avatar Sep 11 '23 05:09 kelson42