Changes to reduce RAM usage

Open wordtracker opened this issue 13 years ago • 1 comments

Hi,

We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.

I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.

Changing the PageStore option has no effect on this as the page_queue does not live there.

Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).

I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.

Thanks, Jamie

Apr 23 '12 10:04 wordtracker

Any movement here?

I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)

I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.

Sep 24 '12 20:09 leehambley