SessionPool generates more sessions that needed and also does not respect "maxUsageCount" constraint.
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
- Create basic PlaywrightCrawler/PuppeteerCrawler/HttpCrawler (at least I've tried all these).
- Set
sessionPoolOptions.sessionOptions.maxUsageCountto 1, it can actually be anything else, it's just easier to see with 1. - Log
sessionIdin default request handler. - Run crawler with 10 start urls.
- Compare
sessionIdlogs with content ofSDK_SESSION_POOL_STATE.jsonfile. - There will be 15 sessions in the file -
usageCount10 of them equals to 0 whileusageCountof 5 another sessions equals to 8. - In console it's logged that there were 5 sessions in use and each were used twice.
Repro is here.
Code sample
import { PlaywrightCrawler } from 'crawlee';
function* getUrls() {
for (let i = 0; i < 10; i++) {
yield `http://localhost:3000/root?key=${i.toString()}`;
}
}
const sessionStats: Record<string, { urls: string[]; count: number }> = {};
async function runCrawlerWithStartUrls() {
const playwrightCrawler = new PlaywrightCrawler({
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 1,
},
},
requestHandler: async ({ session, request }) => {
if (session) {
const data = sessionStats[session.id] ?? { urls: [], count: 0 };
data.count += 1;
data.urls.push(request.url);
sessionStats[session.id] = data;
}
},
});
await playwrightCrawler.run(Array.from(getUrls()));
console.log('Sessions', JSON.stringify(sessionStats, null, 4));
console.log('Sessions used count', Object.keys(sessionStats).length);
}
await runCrawlerWithStartUrls();
And this one of the sessions in SDK_SESSION_POOL_STATE.json. As you can see usageCount is greater than maxUsageCount and errorScore is greater than maxErrorScore.

This way it at least respects maxUsageCount but still generates twice as much sessions than needed.
import { PlaywrightCrawler } from 'crawlee';
function* getUrls() {
for (let i = 0; i < 10; i++) {
yield `http://localhost:3000/root?key=${i.toString()}`;
}
}
const urlGenerator = getUrls();
const sessionStats: Record<string, { urls: string[]; count: number }> = {};
async function runCrawlerWithAddRequests() {
const playwrightCrawler = new PlaywrightCrawler({
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 1,
},
},
requestHandler: async ({ session, request, crawler }) => {
if (session) {
const data = sessionStats[session.id] ?? { urls: [], count: 0 };
data.count += 1;
data.urls.push(request.url);
sessionStats[session.id] = data;
}
const next = urlGenerator.next();
if (!next.done) {
crawler.addRequests([next.value]);
}
},
});
await playwrightCrawler.run(['http://localhost:3000/root']);
console.log('Sessions', JSON.stringify(sessionStats, null, 4));
console.log('Sessions used count', Object.keys(sessionStats).length);
}
await runCrawlerWithAddRequests()
Package version
3.3.0
Node.js version
v18.13.0
Operating system
Ubuntu 22.04.1 LTS on WSL 2
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
3.3.1-beta.10
Other context
No response
Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.
Note that you are not awaiting the crawler.addRequests([next.value]) call, that's another problem in the repro.
Thanks for report. The first test looks a bit weird. Where do the errors even happen? If the page would not load the localhost, your code in request handler would not run.
There is a local server in repo) This can be tested on any others. There are no errors in debug logs. That is the most interesting part.
Note that you are not awaiting the
crawler.addRequests([next.value])call, that's another problem in the repro.
I've tried awaiting it as well but this does not change the result.
@B4nan I've updated repo to await crawler.addRequests calls. Data in SDK_SESSION_POOL_STATE.json still differs from manually collected session usage stats.
I can reproduce this. I use maxUsageCount: 1 as a workaround to enforce spreading the requests to proxies more uniformly, because I don't like the "round robin" strategy used by Crawlee: it only switches proxy for the session once it sees errors, meaning it hammers the one proxy until it fails, which is definitely not what I want.
Even though this helps spreading, I can see more than 1 request being done per session by looking at logs.