crawlee
crawlee copied to clipboard
Multiple calls to enqueueLinks with Promise.all result in a crash
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
Executing the snippet below results in the following error:
Error: Lock file is already being held
at /home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/proper-lockfile/lib/lockfile.js:68:47
at callback (/home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/graceful-fs/polyfills.js:306:20)
at FSReqCallback.oncomplete (node:fs:205:5)
at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17)
Also, it's strange that the traceback doesn't lead to user code.
Code sample
import { CheerioCrawler, PlaywrightCrawler } from 'crawlee';
const startUrls = ['https://warehouse-theme-metal.myshopify.com/collections'];
const crawler = new CheerioCrawler({
requestHandler: async ({request, enqueueLinks, parseWithCheerio}) => {
const $ = await parseWithCheerio()
console.log(`Processing ${request.loadedUrl}`)
const urls = $('a.collection-block-item, a.pagination__next, .product-item > a').toArray().map(it => it.attribs['href'])
await Promise.all(urls?.map(url => enqueueLinks({urls})))
},
maxRequestRetries: 0,
maxConcurrency: 1,
});
await crawler.run(startUrls);
Package version
3.7.3
Node.js version
v21.6.1
Operating system
Linux
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
I know that the reproduction example is a very bad use of the method. I made it this way to make the code crash deterministically.
Same here, not sure where to look for the problem... seems like a missing try/catch or a improperly handled rejection, or a missing error listener somewhere in the library code.
Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.
Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.
It calls enqueueLinks with the complete set of URLs for each URL in the set. This is completely unnecessary and slow. The correct approach would be calling enqueueLinks once, with a selector.
Even though it's bad usage, it uncovered a bug that's worth fixing, which is why I put the sample here.