anemone icon indicating copy to clipboard operation
anemone copied to clipboard

Changes to reduce RAM usage

Open wordtracker opened this issue 13 years ago • 1 comments

Hi,

We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.

I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.

Changing the PageStore option has no effect on this as the page_queue does not live there.

Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).

I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.

Thanks, Jamie

wordtracker avatar Apr 23 '12 10:04 wordtracker

Any movement here?

I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)

I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.

leehambley avatar Sep 24 '12 20:09 leehambley