crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Explore parallelization for CrawlerCheerio

Open jancurn opened this issue 6 years ago • 1 comments

Cheerio is quite CPU intensive, so for higher concurrency of the crawler, the CPU chokes. We should explore whether it's possible to run Cheerio download and parsing in a separate worker thread or process. Theoretically, we could somehow transform and serialize the Cheerio object so that it can be sent between processes. The question is how much overheads will the serialization add. We need to test it and only enable this feature if the CPU is choking.

jancurn avatar Jan 16 '19 12:01 jancurn

Node 12 enables us to use worker_threads natively, with the ability to share memory between threads, efficient communication / serialization, and being able to do atomic operations as well, everything out-of-the-box. we could learn from Rust Tokio with their work-stealing approach for threadpool

pocesar avatar Mar 28 '20 18:03 pocesar