crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Multiple calls to enqueueLinks with Promise.all result in a crash

Open janbuchar opened this issue 1 year ago • 7 comments

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Executing the snippet below results in the following error:

Error: Lock file is already being held
    at /home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/proper-lockfile/lib/lockfile.js:68:47
    at callback (/home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/graceful-fs/polyfills.js:306:20)
    at FSReqCallback.oncomplete (node:fs:205:5)
    at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17)

Also, it's strange that the traceback doesn't lead to user code.

Code sample

import { CheerioCrawler, PlaywrightCrawler } from 'crawlee';

const startUrls = ['https://warehouse-theme-metal.myshopify.com/collections'];

const crawler = new CheerioCrawler({
    requestHandler: async ({request, enqueueLinks, parseWithCheerio}) => {
        const $ = await parseWithCheerio()
        console.log(`Processing ${request.loadedUrl}`)
        const urls = $('a.collection-block-item, a.pagination__next, .product-item > a').toArray().map(it => it.attribs['href'])
        await Promise.all(urls?.map(url => enqueueLinks({urls})))
    },
    maxRequestRetries: 0,
    maxConcurrency: 1,
});

await crawler.run(startUrls);

Package version

3.7.3

Node.js version

v21.6.1

Operating system

Linux

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

janbuchar avatar Feb 07 '24 15:02 janbuchar

I know that the reproduction example is a very bad use of the method. I made it this way to make the code crash deterministically.

janbuchar avatar Feb 07 '24 16:02 janbuchar

Same here, not sure where to look for the problem... seems like a missing try/catch or a improperly handled rejection, or a missing error listener somewhere in the library code.

Somebi avatar Jul 02 '24 17:07 Somebi

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

Jack-Lewis1 avatar Jul 15 '24 17:07 Jack-Lewis1

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

It calls enqueueLinks with the complete set of URLs for each URL in the set. This is completely unnecessary and slow. The correct approach would be calling enqueueLinks once, with a selector.

Even though it's bad usage, it uncovered a bug that's worth fixing, which is why I put the sample here.

janbuchar avatar Jul 16 '24 08:07 janbuchar