crawlee
crawlee copied to clipboard
Add support for Bun runtime - Issue with `browser-pool` and `memory-storage` packages
Which package is the feature request for? If unsure which one to select, leave blank
None
Feature
Hi, I'm cross-posting the notes from https://github.com/apify/proxy-chain/issues/521 (below), where I tried to run an Apify/Crawlee Playwright scraper in Bun runtime.
TL;DR - Currently it's not possible to run Apify/Crawlee scrapers with Bun. There are (at least) 2 unsupported features on Crawlee side, and (at least) 1 error on Playwright side.
-
There was an error with
@crawlee/browser-pool/proxy-server.jswith lineserver.server.unref();I looked into it. The
unrefshould refers tohttp.Server.unref. For some reason, this isn't define in Bun, and this seems to be genuine error on their side (it's not even reported in their docs). -
Out of curiosity, I just commented out that line, to see if I get the crawler to work. It printed the initial log with system info
INFO System info {"apifyVersion":"3.1.4","apifyClientVersion":"2.7.1","crawleeVersion":"3.3.1","osType":"Darwin","nodeVersion":"v18.15.0"}However, the run still ended in an error. Here, the
promises_1.opendirrefer tofs.promises.opendir(node:fs). Unfortunately, none of theopendirfunctions are currently defined Bun (fs.opendirSync,fs.opendir,fs.promises.opendir).ERROR (0, promises_1.opendir) is not a function. (In '(0, promises_1.opendir)(keyValueStoreDir)', '(0, promises_1.opendir)' is undefined) TypeError: (0, promises_1.opendir) is not a function. (In '(0, promises_1.opendir)(keyValueStoreDir)', '(0, promises_1.opendir)' is undefined) at <anonymous> (/Users/presenter/repos/apify-actor-facebook/node_modules/@crawlee/memory-storage/cache-helpers.js:110:25) -
I managed to get start a Playwright crawler in Bun with following changes to the Apify packages:
- I commented out the
server.server.unref();in@crawlee/browser-pool/proxy-server.js - I replaced
fs.promises.opendir(dirName)withfs.promises.readdir(dirName, { withFileTypes: true })in@crawlee/memory-storage/cache-helpers.js- NOTE: Good thing is that with the
withFileTypes: trueoption, bothopendirandreaddirresolve to an iterable of Dirent. Bad thing, from my understandingopendiryields the entries one-by-one as they are found, whereasreaddirresolves only once all items have been found. So replacingopendirwithreaddirmight add extra waiting time.
- NOTE: Good thing is that with the
- I commented out the
-
With changes in step 3., I managed to start a Playwright crawler, to the point where Playwright command was executed. Afterwards, there is an issue on Playwright side with
child_process.spawn. You can find more about that issue here:- https://github.com/oven-sh/bun/issues/4253.
Motivation
Make Crawlee scrapers more performant by using Bun runtime instead of Node.
Ideal solution or implementation, and any additional constraints
Be able to run crawlee scrapers with Bun. However, Bun is still experimental, so this is a slow-burner.
Alternative solutions or implementations
No response
Other context
No response
We'd definitely want to support bun (as well as deno) at some point, but as you already pointed out, it will be mostly about them providing the missing APIs rather than us changing something.
Also, keep in mind that the speed difference will be most probably not measurable when it comes to the actual scraping - the slowness is coming from the network traffic (doing requests) and proxy usage, not from slow JavaScript execution.
As noted, this is an issue with the Bun runtime, feel free to close @B4nan 👍
This can be tracked here: https://github.com/oven-sh/bun/issues/5606
+1