Changes to reduce RAM usage
Hi,
We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.
I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.
Changing the PageStore option has no effect on this as the page_queue does not live there.
Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).
I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.
Thanks, Jamie
Any movement here?
I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)
I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.