crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint.

Open yellott opened this issue 2 years ago • 8 comments

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

  1. Create basic PlaywrightCrawler/PuppeteerCrawler/HttpCrawler (at least I've tried all these).
  2. Set sessionPoolOptions.sessionOptions.maxUsageCount to 1, it can actually be anything else, it's just easier to see with 1.
  3. Log sessionId in default request handler.
  4. Run crawler with 10 start urls.
  5. Compare sessionId logs with content of SDK_SESSION_POOL_STATE.json file.
  6. There will be 15 sessions in the file - usageCount 10 of them equals to 0 while usageCount of 5 another sessions equals to 8.
  7. In console it's logged that there were 5 sessions in use and each were used twice.

Repro is here.

Code sample

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithStartUrls() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },
        requestHandler: async ({ session, request }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }
        },
    });

    await playwrightCrawler.run(Array.from(getUrls()));

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithStartUrls();

And this one of the sessions in SDK_SESSION_POOL_STATE.json. As you can see usageCount is greater than maxUsageCount and errorScore is greater than maxErrorScore.

image

This way it at least respects maxUsageCount but still generates twice as much sessions than needed.

import { PlaywrightCrawler } from 'crawlee';

function* getUrls() {
    for (let i = 0; i < 10; i++) {
        yield `http://localhost:3000/root?key=${i.toString()}`;
    }
}

const urlGenerator = getUrls();

const sessionStats: Record<string, { urls: string[]; count: number }> = {};

async function runCrawlerWithAddRequests() {
    const playwrightCrawler = new PlaywrightCrawler({
        sessionPoolOptions: {
            sessionOptions: {
                maxUsageCount: 1,
            },
        },

        requestHandler: async ({ session, request, crawler }) => {
            if (session) {
                const data = sessionStats[session.id] ?? { urls: [], count: 0 };
                data.count += 1;
                data.urls.push(request.url);
                sessionStats[session.id] = data;
            }

            const next = urlGenerator.next();
            if (!next.done) {
                crawler.addRequests([next.value]);
            }
        },
    });

    await playwrightCrawler.run(['http://localhost:3000/root']);

    console.log('Sessions', JSON.stringify(sessionStats, null, 4));
    console.log('Sessions used count', Object.keys(sessionStats).length);
}

await runCrawlerWithAddRequests()

Package version

3.3.0

Node.js version

v18.13.0

Operating system

Ubuntu 22.04.1 LTS on WSL 2

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

3.3.1-beta.10

Other context

No response

yellott avatar Mar 19 '23 18:03 yellott

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

metalwarrior665 avatar Mar 20 '23 10:03 metalwarrior665

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

B4nan avatar Mar 20 '23 10:03 B4nan

Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.

There is a local server in repo) This can be tested on any others. There are no errors in debug logs. That is the most interesting part.

yellott avatar Mar 20 '23 10:03 yellott

Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.

I've tried awaiting it as well but this does not change the result.

yellott avatar Mar 20 '23 10:03 yellott

@B4nan I've updated repo to await crawler.addRequests calls. Data in SDK_SESSION_POOL_STATE.json still differs from manually collected session usage stats.

yellott avatar Mar 20 '23 10:03 yellott

I can reproduce this. I use maxUsageCount: 1 as a workaround to enforce spreading the requests to proxies more uniformly, because I don't like the "round robin" strategy used by Crawlee: it only switches proxy for the session once it sees errors, meaning it hammers the one proxy until it fails, which is definitely not what I want.

Even though this helps spreading, I can see more than 1 request being done per session by looking at logs.

zopieux avatar Apr 07 '24 20:04 zopieux