mwoffliner
mwoffliner copied to clipboard
Introduce MediaWiki Parsoid API to render articles
Fixes https://phabricator.wikimedia.org/T324866
Superseeds #1846
@VadimKovalenkoSNF I'm a bit lost, can you please tell:
- Which MWOffliner ticket it fixes
- What the approaches to fix it
Any chance to get that completed today?
@kelson42 , I haven't noticed dedicated ticket for mwoffliner. This patch mostly replicates functionality from https://github.com/openzim/mwoffliner/pull/1846 but on top of recent changes. In fact, it solves both problems - reduces traffic to MW infrastructure + allows mwoffliner to avoid redundant request to get modules per article if Wiki supports action API with parsoid=1
parameter. Do you want me to open the ticket here as well? This can be fixed today, I need to take a look at some broken tests. Note, that I still didn't get the answer to https://phabricator.wikimedia.org/T324866#9139172 but my solution seems to work without useparsoid
.
No real need to open an issue our side... glad if i can review tge PR soon.
Any chance to get that completed today?
Update: I've noticed that parsoid API has troubles with media treatment, and probably other issues in the output. Compare the example of https://en.m.wikipedia.org/wiki/User:Kelson/MWoffliner_CI_reference on the screenshot below. On the left side is the output of WikimediaDesktop and on the right side - MediaWiki Parsoid. Debugging these treatments might take some time, at least I can try to figure out how to enable missing media in the gallery, etc.
@VadimKovalenkoSNF can you please explain in description the principle of your PR because i don't get it. It shoukd be onkt about adding "parsoid" to an url... and now we talk about something very different.
@kelson42 This PR introduces new renderer based on parsoid=1
in the MediaWiki Action API. Instead of WikimediaDesktop that represented by this example https://en.wikipedia.org/api/rest_v1/page/html/Foobar, mwoffliner will query this endpoint:
https://en.m.wikipedia.org/w/api.php?action=parse&format=json&prop=text%7Cmodules%7Cjsconfigvars%7Cheadhtml&parsoid=1&page=Foobar
As you can see, it has text
property with article HTML as well as headhtml
, modules
, modulescripts
, modulestyles
and jsconfigvars
. Having these properties in the single response will prevent mwoffliner from triggering additional request to get modules.
The problem I denoted is about different article html from MediaWiki Action API and WikimediaDesktop response which in result will lead to the different output (missing media, etc)
@VadimKovalenkoSNF Needs to be rebased