Proxy changes for same session
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/browser (BrowserCrawler)
Issue description
According to the documentation the proxies and sessions are bound together to avoid blocking if the same sessions run with another IP address. The documentation gives a similar example:
new PlaywrightCrawler({
...
useSessionPool: true,
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 7
...
}
},
proxyConfiguration: new ProxyConfiguration({ proxyUrls: proxyList() }), // List of 250 Proxy, same IP different Port
...
})
But if I check the proxy and session in the router, the session ID does not match the proxies' session ID:
log.info(session?.id)
log.info(proxyInfo?.port)
log.info(proxyInfo?.sessionId)
This outputs something like:
INFO PlaywrightCrawler: session_AlZoomLhQU
INFO PlaywrightCrawler: 10209
INFO PlaywrightCrawler: session_Dnha2MhDeX
....
INFO PlaywrightCrawler: session_AlZoomLhQU
INFO PlaywrightCrawler: 10208
INFO PlaywrightCrawler: session_6jOviCJSHt
...
The problem seems to be that the proxy is loaded before the page context is enhanced, which can change the session..
A local working solution is to load the proxy after the session is again loaded. This can be done by moving the code block below the last mentioned line.
After the change the output looks like this:
INFO PlaywrightCrawler: session_zBwqeH4a7N
INFO PlaywrightCrawler: 10204
INFO PlaywrightCrawler: session_zBwqeH4a7N
I'm sorry for not providing a PR for this because I don't know if this has other implications and it's not easy for me to add an adequate test fast.
Related to https://discord.com/channels/801163717915574323/1243449005820874763
Code sample
No response
Package version
latest
Node.js version
20
Operating system
macOs
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
Thank you @harm-matthias-harms for bringing this up.
Indeed, there is an issue with the way we're handling the sessions in the browser crawlers. This is because a running browser instance can be reused for multiple requests, but will always have only one proxy URL / session tied to it (because of technical reasons).
We'll try to straighten this up in upcoming patches - in the meantime, you can get the expected behavior by switching the launchContext.useIncognitoPages crawler constructor parameter to true. Note that this tells Crawlee to use a new browser instance for each request, so it can worsen the performance of your crawlers. The actual numbers depend on your use case though.
const crawler = new PlaywrightCrawler({
launchContext: {
useIncognitoPages: true, // Use one browser per request, fixes the session pairing issues
},
requestHandler: async ({ enqueueLinks, session, proxyInfo }) => {
...
}
});
Apologies for a ping on such an ancient ticket - I just want to mention we still keep track of this issue. Making this part of Crawlee right is one of our main goals for the upcoming 4.0 release.
I'll close this ticket in favor of #2310 , which is about the same issue, but I provide more context in the comments there.
Thank you! (and sorry for the ping again :))