crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat(parallel): add rust gRPC high performance parallel processing

Open j-mendez opened this issue 2 years ago • 0 comments

  • feat rust high performance efficient parallel crawling start

This pr starts the integration with a native high performance gRPC crawler that is the fastest and most efficient OSS indexer to find pages for crawling. It's designed more as an indexer to find urls and toss the work to another system that can handle what is needed x, y, and z. The repo crawler extends spider with gRPC capabilities. Using the native multi-threaded crawler to initial get the pages is vastly more performant than doing the work amidst the page view ( also cannot be done in parallel ).

The initial crawl is commenced on the basic crawler https://github.com/apify/apify-js/compare/master...j-mendez:apify-js:feat/high-perf-rust-crawl?expand=1#diff-bb6b40450451835ae5b3b7fa2c2a3d9cbfb4b7fcd43d8fc62eeeb90fa58db4bcR500 and console log at https://github.com/apify/apify-js/compare/master...j-mendez:apify-js:feat/high-perf-rust-crawl?expand=1#diff-da4d948e7ad86fb08c507984955430c6cac2ef8ed01efbc5098f1a3c70ad3269R35. This would be the area to perform the page crawls on find with puppeteer. At the moment I do not have enough time to finish and call the exact crawl methods but, I provided a placeholder log where the page actions would perform and an event emitter that can be used to determine when the crawler finishes gathering pages and the puppeteer pages are finished. All of the handling is done via async streams allowing for an insane amount of processing at once (10,000 pages crawled if static in under a minute when delays are disabled ).

Here are the benchmarks for crawling using the spider rust crate.

I thought this library was cool and thought this would be the next thing to add to take it up a notch. ( sorry about the prettier formatting on the extra files ).

Heres a screenshot of the log outputs on npm run test test/crawlers/basic_crawler.test.js

Pages being crawled along side the basic crawler using the rust gRPC indexer

Architecture reasons on why this is fast, efficient, and a good choice to use.

  1. request comes in to crawler spawns a primary thread.
  2. request then gathers websites in a sub threaded pool re-using the connection from the primary thread.
  3. finding a page algorithm optimized to be faster than any of ways of finding if a website exist by using not CSS parser selectors against what a web page should be which outperforms has or any regex usage.
  4. links found are streamed to the gRPC server connected on the primary thread ( creating a connection at the thread level and talking to a rPC server outperforms streaming connections and passing the initial stream context across sub threads and pools.

Excuse the ugly quick drawing to explain some choices =) below.

rough drawing of the architecture handling of threads and pools for crawling with re-using connections

j-mendez avatar Jul 11 '22 17:07 j-mendez