crawlee
crawlee copied to clipboard
feat(parallel): add rust gRPC high performance parallel processing
- feat rust high performance efficient parallel crawling start
This pr starts the integration with a native high performance gRPC crawler that is the fastest and most efficient OSS indexer to find pages for crawling. It's designed more as an indexer to find urls and toss the work to another system that can handle what is needed x, y, and z. The repo crawler extends spider with gRPC capabilities. Using the native multi-threaded crawler to initial get the pages is vastly more performant than doing the work amidst the page view ( also cannot be done in parallel ).
The initial crawl is commenced on the basic crawler https://github.com/apify/apify-js/compare/master...j-mendez:apify-js:feat/high-perf-rust-crawl?expand=1#diff-bb6b40450451835ae5b3b7fa2c2a3d9cbfb4b7fcd43d8fc62eeeb90fa58db4bcR500 and console log at https://github.com/apify/apify-js/compare/master...j-mendez:apify-js:feat/high-perf-rust-crawl?expand=1#diff-da4d948e7ad86fb08c507984955430c6cac2ef8ed01efbc5098f1a3c70ad3269R35. This would be the area to perform the page crawls on find with puppeteer. At the moment I do not have enough time to finish and call the exact crawl methods but, I provided a placeholder log where the page actions would perform and an event emitter that can be used to determine when the crawler
finishes gathering pages and the puppeteer
pages are finished. All of the handling is done via async streams allowing for an insane amount of processing at once (10,000 pages crawled if static in under a minute when delays are disabled ).
Here are the benchmarks for crawling using the spider rust crate.
I thought this library was cool and thought this would be the next thing to add to take it up a notch. ( sorry about the prettier formatting on the extra files ).
Heres a screenshot of the log outputs on npm run test test/crawlers/basic_crawler.test.js
Architecture reasons on why this is fast, efficient, and a good choice to use.
- request comes in to crawler spawns a primary thread.
- request then gathers websites in a sub threaded pool re-using the connection from the primary thread.
- finding a page algorithm optimized to be faster than any of ways of finding if a website exist by using
not
CSS parser selectors against what a web page should be which outperformshas
or any regex usage. - links found are streamed to the gRPC server connected on the primary thread ( creating a connection at the thread level and talking to a rPC server outperforms streaming connections and passing the initial stream context across sub threads and pools.
Excuse the ugly quick drawing to explain some choices =) below.
