wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

Dumpgenerator rewrite: Use pywikibot (pywikipedia)

Open PiRSquared17 opened this issue 10 years ago • 12 comments

Nemo has suggested to port the existing code to use the pywikipediabot framework.

PiRSquared17 avatar Sep 28 '14 03:09 PiRSquared17

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

emijrp avatar Sep 29 '14 07:09 emijrp

Emilio J. Rodríguez-Posada, 29/09/2014 09:41:

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

Getting API entry point, page lists and XML could all be delegated to PWB.

This would be the implementation of the rewrite plan: https://meta.wikimedia.org/wiki/WikiTeam/Dumpgenerator_rewrite

nemobis avatar Sep 29 '14 07:09 nemobis

Our API entry point is pretty simple with requests module and works fine, right? The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

I mean, we can use/copy some of their modules/functions. But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.

Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.

Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.

2014-09-29 9:48 GMT+02:00 nemobis [email protected]:

Emilio J. Rodríguez-Posada, 29/09/2014 09:41:

What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.

Getting API entry point, page lists and XML could all be delegated to PWB.

— Reply to this email directly or view it on GitHub https://github.com/WikiTeam/wikiteam/issues/195#issuecomment-57127022.

emijrp avatar Sep 29 '14 08:09 emijrp

Emilio J. Rodríguez-Posada, 29/09/2014 10:08:

Our API entry point is pretty simple with requests module and works fine, right?

Dunno. There's also the screenscraping part which is a bunch of regex hacks. Same for entry point extraction, already handled by pwb https://gerrit.wikimedia.org/r/160207

The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

You could test https://gerrit.wikimedia.org/r/#/c/136352/

I mean, we can use/copy some of their modules/functions.

Forking PWB is not an option.

But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.

Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.

That's what the rewrite branch is for. :)

Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.

Of course a partnership needs to have benefit for both sides.

nemobis avatar Sep 29 '14 08:09 nemobis

the core pwb library (v2.0) doesnt have a lot of dependencies. in fact, only one dependency: httplib2. We are in the process of completing/packaging pwb v2.0

And that is the primary problem, I believe. wikiteam has moved to requests, while pwb uses httplib2. I think we could solve that by either a) improving pwb to support requests, or b) lots of testing of wikiteam/requests and a 'pwb lite with only httplib2 dependency' package

jayvdb avatar Sep 29 '14 11:09 jayvdb

What is the minimum version of python that wikiteam wants support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).

jayvdb avatar Sep 29 '14 12:09 jayvdb

John Vandenberg, 29/09/2014 14:35:

What is the minimum version of python that wikiteam /wants/ support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).

I think 2.6+ is enough now, that BC code was from earlier on. Nowadays most compatibility complaints we get are from python3 users and similar.

nemobis avatar Sep 29 '14 12:09 nemobis

pywikibot works on python 3 ;-) with 500+ tests https://travis-ci.org/wikimedia/pywikibot-core

jayvdb avatar Sep 29 '14 13:09 jayvdb

Thanks John for https://meta.wikimedia.org/w/index.php?title=WikiTeam%2FDumpgenerator_rewrite&diff=10039513&oldid=8892313 emjirp, the last bullet changed is already one feature we'd gain.

nemobis avatar Sep 29 '14 19:09 nemobis

@jayvdb Does pwb support old versions of MediaWiki (e.g. MW 1.9)?

PiRSquared17 avatar Oct 01 '14 16:10 PiRSquared17

And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?

I had forgotten it but the rewrite page says "if the api call uses api.CachedRequest, it will write to the disk." So that's another bug it's supposed to fix.

nemobis avatar Oct 01 '14 17:10 nemobis

6 years have passed, and now the younger versions of MediaWiki might be more frequent out there than the ancient ones we focused a lot on. It's possible that nowadays we can freeze the features of the old index.php scraping and rely on a library like mwclient for the newer versions. Time will tell.

nemobis avatar Feb 10 '20 22:02 nemobis