crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Reduce the amount of dependency code to improve startup time

Open metalwarrior665 opened this issue 2 months ago • 3 comments

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

Importing Crawlee and Apify SDK comes with considerable Node.js startup cost. Measured on Apify's Actor run with 512 MB (sort of average in Apify Store), importing just crawlee and apify adds more than 900ms to the startup time.

Importing just @crawlee/cheerio or @crawlee/playwright cuts this down to 550ms (or 650ms respectively). Bundling the code with @vercel/ncc cuts down to 350ms and using NODE_COMPILE_CACHE env var adds small reduction to 330ms which is where we currently sit after these optimizations.

We should look if we can cut this number a bit more.

Recently discussed on Slack: https://apify.slack.com/archives/CD0SF6KD4/p1760687939450429?thread_ts=1760623161.409249&cid=CD0SF6KD4

Motivation

As our runtime environments and scraper code get faster, any slower parts become the bottleneck. Apify Actors got to the point where they frequently boot up Docker container below 1 second while the code often just hits a few JSON APIs with minimal CPU work required.

Recently, we were able to find a lot of optimization in Crawlee and SDK code and we could realistically see sub second Actor runs soon with majority of finish in below 2 seconds. This enables live API use-cases without need for Standby servers.

Ideal solution or implementation, and any additional constraints

We probably still import a lot of code that is either not used at all or is a duplicate.

  1. More advanced bundlers might be able to tree shake more but it is hard to make them work because of the ESM/CJS interop.
  2. Configure or patch dependencies that bring a lot unused code.
  3. Deduplicate our own dependencies. E.g. use got for scraping, axios for Apify client, potentially impit or curl-impersonate

Alternative solutions or implementations

No response

Other context

No response

metalwarrior665 avatar Oct 24 '25 11:10 metalwarrior665

One current issue is that we depend on Cheerio in many cases where we don't need to, e.g. using just @crawlee/utils or making HttpCrawler that just needs to parse JSON.

I tested patch-package ing out Cheerio from @crawlee packages for HttpCrawler and got a decent improvement of about 330ms -> 290ms startup and 2MB -> 1.6MB bundle reduction (using ncc).

metalwarrior665 avatar Oct 29 '25 11:10 metalwarrior665

Every crawler has parseWithCheerio helper, including HttpCrawler.

Note that I dont see many opportunities to reduce dependencies, only thing I can imagine is bundling in the actual project, not much we can do in crawlee directly.

B4nan avatar Oct 29 '25 11:10 B4nan

Every crawler has parseWithCheerio helper, including HttpCrawler.

You could import dynamically on the first use.

metalwarrior665 avatar Oct 29 '25 11:10 metalwarrior665