puppeteer-cluster icon indicating copy to clipboard operation
puppeteer-cluster copied to clipboard

Can't use for loops with multiple cluster queues in puppeteer-cluster?

Open sbr2567 opened this issue 5 years ago • 4 comments

What is the correct method in which to use for loops inside a cluster queue function?

I have an array of 4656 URL's that take around 77 minutes to visit individually with one single puppeteer instance, for each URL visited, the page is evaluated and a for loop is used to get all elements within a particular class.

I want to cut that time down substantially, so puppeteer-cluster certainly grabbed my interest.

In my attempt to use it, I first prepared the 4656 URL's so that they could be visited by 8 concurrent instances / clusters of puppeteer. I did this by dividing the 4656 URL's into 8 arrays holding 582 of the URL's each.

const concurrentGroupLength = Math.round(urlArray.length / 8);
const concurrentGroup1 = urlArray.slice(0, concurrentGroupLength);
const concurrentGroup2 = urlArray.slice(concurrentGroupLength, concurrentGroupLength * 2);
const concurrentGroup3 = urlArray.slice(concurrentGroupLength * 2, concurrentGroupLength * 3);
const concurrentGroup4 = urlArray.slice(concurrentGroupLength * 3, concurrentGroupLength * 4);
const concurrentGroup5 = urlArray.slice(concurrentGroupLength * 4, concurrentGroupLength * 5);
const concurrentGroup6 = urlArray.slice(concurrentGroupLength * 5, concurrentGroupLength * 6);
const concurrentGroup7 = urlArray.slice(concurrentGroupLength * 6, concurrentGroupLength * 7);
const concurrentGroup8 = urlArray.slice(concurrentGroupLength * 7, concurrentGroupLength * 8 + 1);

Next I launched puppeteer-cluster, set a maxConcurrency of 8, and queued 8 clusters for each URL array.

const results = [];

(async () => {
        const cluster = await Cluster.launch({
            concurrency: Cluster.CONCURRENCY_CONTEXT,
            maxConcurrency: 8,
            timeout: 960000,
        });

        const getPageData = async ({page, data: urls}) => {
            for (i = 0; i < urls.length; i++) {
                await page.goto(urls[i], {waitUntil: 'networkidle2'});

                let titleArray = await page.evaluate(() => {
                    let pageTitles = [];
                    for (j = 0; j < document.getElementsByClassName('canvas')[0].getElementsByClassName('genre scanme').length; j++) {
                        pageTitles.push(document.getElementsByClassName('canvas')[0].getElementsByClassName('genre scanme')[j].innerText);
                    }
                    return pageTitles;
                });

                results.push({titles: titleArray});
                console.log(results)
            }
        };

        cluster.queue(concurrentGroup1, getPageData);
        cluster.queue(concurrentGroup2, getPageData);
        cluster.queue(concurrentGroup3, getPageData);
        cluster.queue(concurrentGroup4, getPageData);
        cluster.queue(concurrentGroup5, getPageData);
        cluster.queue(concurrentGroup6, getPageData);
        cluster.queue(concurrentGroup7, getPageData);
        cluster.queue(concurrentGroup8, getPageData);

        cluster.on('taskerror', (err, data, willRetry) => {
            if (willRetry) {
                console.warn(`Encountered an error while crawling ${data}. ${err.message}\nThis job will be retried`);
            } else {
                console.error(`Failed to crawl ${data}: ${err.message}`);
            }
        });

        await cluster.idle();
        await cluster.close();
    })();

However puppeteer-cluster could not run a for loop concurrently within the function. The for loop glitches and exceeds the max iteration defined by i < urls.length, which is always 582. This returns undefined for each iteration exceeding 582, which so happened to be the same amount of queued clusters running.

This occurs because it tries visit pages with await page.goto(urls[i], {waitUntil: 'networkidle2'}); that obviously don't exist because urls[i] during those exceeded iterations were equal to undefined.

Please refer to the video I made below illustrating this issue. The logs you see in the video are referring to the console.log(results) you can find in the script.

https://www.youtube.com/watch?v=G0pHypc7SBs&feature=youtu.be

Thank you.

sbr2567 avatar Aug 01 '20 06:08 sbr2567

Did this ever get resolved?

JKelly423 avatar Jun 29 '22 18:06 JKelly423

I decided to split up the tasks, and used a single puppeteer instance to handle everything outside of the for loop.

I reformatted my cluster.task() to exclude a loop, and it works perfectly. I'd recommend using a single instance to scrape links, then evaluate the links uniformly using cluster.task().

Hope this helps, I can provide more info if needed :)

JKelly423 avatar Jun 29 '22 20:06 JKelly423

i got JSHandles can be evaluated only in the context they were created when running for loop inside queue funtion

const reviews=await page.$x(list_selector)
for await (review of reviews){
    const [rating] =await review.$x(rating_selector)
    const rating_class=await page.evaluate(el => el.className, rating)
}

but if i remove rating_class and put this await page.evaluate(() => {document.querySelector('div')}) script works fine

maxwill-max avatar Aug 27 '22 13:08 maxwill-max

it triggers on random within first few pages

maxwill-max avatar Aug 27 '22 13:08 maxwill-max