mw
mw copied to clipboard
Slowness when number of page in cache increase
Basically, I tried to pull 892 files description from a category. the 25 ones were pulled in very fast. But as the total number of files in cache increased, the process becomes slower and slower.
17:11 38 17:12 33 17:13 21 17:14 17 17:15 14 17:16 12 17:17 12 17:18 10 17:19 10 17:20 9 17:21 8 17:22 9 17:23 7 17:24 8 17:25 7 17:26 7 17:27 6 17:28 6 17:29 6 17:30 6 17:31 6 17:32 5 17:33 5 17:34 6 17:35 5 17:36 4 17:37 5 17:38 5 17:39 5 17:40 4 17:41 5 17:42 4 17:43 4 17:44 5 17:45 4 17:46 4 17:47 3 17:48 4 17:49 4 17:50 4 17:51 4 17:52 3 17:53 4 17:54 3 17:55 4 17:56 3 (...)
21:25 1 21:26 1 I killed the process at this point, 737 files had been pulled in...
I think the pagedict and the cache system needs another structure and some rework to be efficient when the total number of files in cache are excessing 250 files...
Any opinion? ideas?
On 06/21/2012 04:55 PM, yves tennevin wrote:
I think the pagedict and the cache system needs another structure and some rework to be efficient when the total number of files in cache are excessing 250 files...
I never fully grokked the pull and categories code and (you can see my questions in the code) but I was able to add a few of the features I wanted. Similarly, if there's something you want, feel free to fork and improve.
When I've encountered problems like this elsewhere, (i.e., I know I'm scraping lots of pages and exceeding memory) sometimes the fix is as simple as using a file based dict, which is slower when N is small but better when big.
About the speed issue: There are several factors that have a speed impact:
- PullCommand (on this version) calls Pull for each files - I rewrote it so all files are stacked into the args and the command Pull is called once each 500 files / pages pulled.
- the api_setup method was called several times by PullCommand, each time it called Pull... Now, please note this would have no impact on other command, since they are not repeatiting a given command.
- the pagedict from metadir is more probablematic: it's probably the souce of the slowness here. I was able to not load the pagedict each time a file was pulled, the 892 file description were pulled in 7 minutes... Still, it does not help if files are being pulled manually with the pull command, as the pagedict will be loaded each time. Using several md5 hash could probably help here...
About the questions in the code:
- range(0, len(pages), 25)]: # what does this '25' do? - reagle
While I ain't the author of this code, I think it parses the args (files/pages to pull) by 25 entries. The api normally allows to requests several entries at a given time: according to the api doc, the default limit is 50 for normal users and 500 for bots. I assume 25 is a safe number here.
Is the revisions list a sorted one, should I use [0] or [-1]? - reagle
Looking at the data.. 'action': 'query', 'titles': '|'.join(these_pages), 'prop': 'info|revisions', 'rvprop': 'ids|flags|timestamp|user|comment|content', which would result in: (example):
http://commons.wikimedia.org/w/api.php?action=query&prop=info|revisions&meta=siteinfo&titles=User:Esby/test&rvprop=ids|flags|timestamp|user|comment|content
==> rev revid="72960419" parentid="58809473" user="Esby-mw-bot" timestamp="2012-06-19T22:24:54Z" comment="testing new version of mw" xml:space="preserve"
I think here that revid does not contains the full list of revision, but some information on the last revision and the previous one (parentid), so to answer, I'd just check the content of a query. I think 0 works. (as I did not wrote the Pull, only the Category Pull code.
On 06/21/2012 06:33 PM, yves tennevin wrote:
While I ain't the author of this code, I think it parses the args (files/pages to pull) by 25 entries.
That's what I suspected, I changed the comment.
I think here that revid does not contains the full list of revision, but some information on the last revision and the previous one (parentid), so to answer, I'd just check the content of a query. I think 0 works. (as I did not wrote the Pull, only the Category Pull code.
Yes, and its wiki_revids that is the sorted revids (which is created later). I updated the comment.
I don't use this code much at all anymore, and I don't think Ian does either. But when you are comfortable with your changes feel free to send a pull request.