wikiteam
wikiteam copied to clipboard
Dumpgenerator rewrite: Use pywikibot (pywikipedia)
Nemo has suggested to port the existing code to use the pywikipediabot framework.
What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.
Emilio J. Rodríguez-Posada, 29/09/2014 09:41:
What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.
Getting API entry point, page lists and XML could all be delegated to PWB.
This would be the implementation of the rewrite plan: https://meta.wikimedia.org/wiki/WikiTeam/Dumpgenerator_rewrite
Our API entry point is pretty simple with requests module and works fine, right? The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?
I mean, we can use/copy some of their modules/functions. But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.
Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.
Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.
2014-09-29 9:48 GMT+02:00 nemobis [email protected]:
Emilio J. Rodríguez-Posada, 29/09/2014 09:41:
What functions of pywikipediabot are needed? I prefer to mantain the list of dependences to the lowest.
Getting API entry point, page lists and XML could all be delegated to PWB.
— Reply to this email directly or view it on GitHub https://github.com/WikiTeam/wikiteam/issues/195#issuecomment-57127022.
Emilio J. Rodríguez-Posada, 29/09/2014 10:08:
Our API entry point is pretty simple with requests module and works fine, right?
Dunno. There's also the screenscraping part which is a bunch of regex hacks. Same for entry point extraction, already handled by pwb https://gerrit.wikimedia.org/r/160207
The page lists may fail while scraping HTML, but through API is good. And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?
You could test https://gerrit.wikimedia.org/r/#/c/136352/
I mean, we can use/copy some of their modules/functions.
Forking PWB is not an option.
But adding the whole framework as a dependence (which contains dozens of scripts and directories), I don't think that it is needed.
Before doing any move, I would like to see examples where WikiTeam fails and Pywikibot rocks.
That's what the rewrite branch is for. :)
Pywikibot has a great community of skilled coders. We can request them help to fix some of our bugs, while we mantain the independence.
Of course a partnership needs to have benefit for both sides.
the core pwb library (v2.0) doesnt have a lot of dependencies. in fact, only one dependency: httplib2. We are in the process of completing/packaging pwb v2.0
And that is the primary problem, I believe. wikiteam has moved to requests, while pwb uses httplib2. I think we could solve that by either a) improving pwb to support requests, or b) lots of testing of wikiteam/requests and a 'pwb lite with only httplib2 dependency' package
What is the minimum version of python that wikiteam wants support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).
John Vandenberg, 29/09/2014 14:35:
What is the minimum version of python that wikiteam /wants/ support for. I see dumpgenerator tries to support py2.4 on line 35 "from md5 import new as md5" to provide fixed maximum length filenames (I think).
I think 2.6+ is enough now, that BC code was from earlier on. Nowadays most compatibility complaints we get are from python3 users and similar.
pywikibot works on python 3 ;-) with 500+ tests https://travis-ci.org/wikimedia/pywikibot-core
Thanks John for https://meta.wikimedia.org/w/index.php?title=WikiTeam%2FDumpgenerator_rewrite&diff=10039513&oldid=8892313 emjirp, the last bullet changed is already one feature we'd gain.
@jayvdb Does pwb support old versions of MediaWiki (e.g. MW 1.9)?
And the XML, we have some issues with pretty big histories (memory issues), but do we know if pywikibot manage this OK?
I had forgotten it but the rewrite page says "if the api call uses api.CachedRequest, it will write to the disk." So that's another bug it's supposed to fix.
6 years have passed, and now the younger versions of MediaWiki might be more frequent out there than the ancient ones we focused a lot on. It's possible that nowadays we can freeze the features of the old index.php scraping and rely on a library like mwclient for the newer versions. Time will tell.