crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Proxy changes for same session

Open harm-matthias-harms opened this issue 1 year ago • 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/browser (BrowserCrawler)

Issue description

According to the documentation the proxies and sessions are bound together to avoid blocking if the same sessions run with another IP address. The documentation gives a similar example:

new PlaywrightCrawler({
 ...
  useSessionPool: true,
  sessionPoolOptions: {
    sessionOptions: {
      maxUsageCount: 7
     ...
    }
  },
  proxyConfiguration: new ProxyConfiguration({ proxyUrls: proxyList() }), // List of 250 Proxy, same IP different Port
...
})

But if I check the proxy and session in the router, the session ID does not match the proxies' session ID:

log.info(session?.id)
log.info(proxyInfo?.port)
log.info(proxyInfo?.sessionId)

This outputs something like:

INFO  PlaywrightCrawler: session_AlZoomLhQU
INFO  PlaywrightCrawler: 10209
INFO  PlaywrightCrawler: session_Dnha2MhDeX
....
INFO  PlaywrightCrawler: session_AlZoomLhQU
INFO  PlaywrightCrawler: 10208
INFO  PlaywrightCrawler: session_6jOviCJSHt
...

The problem seems to be that the proxy is loaded before the page context is enhanced, which can change the session..

A local working solution is to load the proxy after the session is again loaded. This can be done by moving the code block below the last mentioned line.

After the change the output looks like this:

INFO  PlaywrightCrawler: session_zBwqeH4a7N
INFO  PlaywrightCrawler: 10204
INFO  PlaywrightCrawler: session_zBwqeH4a7N

I'm sorry for not providing a PR for this because I don't know if this has other implications and it's not easy for me to add an adequate test fast.

Related to https://discord.com/channels/801163717915574323/1243449005820874763

Code sample

No response

Package version

latest

Node.js version

20

Operating system

macOs

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

harm-matthias-harms avatar May 27 '24 09:05 harm-matthias-harms

Thank you @harm-matthias-harms for bringing this up.

Indeed, there is an issue with the way we're handling the sessions in the browser crawlers. This is because a running browser instance can be reused for multiple requests, but will always have only one proxy URL / session tied to it (because of technical reasons).

We'll try to straighten this up in upcoming patches - in the meantime, you can get the expected behavior by switching the launchContext.useIncognitoPages crawler constructor parameter to true. Note that this tells Crawlee to use a new browser instance for each request, so it can worsen the performance of your crawlers. The actual numbers depend on your use case though.

const crawler = new PlaywrightCrawler({
    launchContext: {
        useIncognitoPages: true, // Use one browser per request, fixes the session pairing issues
    },
    requestHandler: async ({ enqueueLinks, session, proxyInfo }) => {
        ...
    }
});

barjin avatar Jun 12 '24 12:06 barjin

Apologies for a ping on such an ancient ticket - I just want to mention we still keep track of this issue. Making this part of Crawlee right is one of our main goals for the upcoming 4.0 release.

I'll close this ticket in favor of #2310 , which is about the same issue, but I provide more context in the comments there.

Thank you! (and sorry for the ping again :))

barjin avatar Apr 03 '25 09:04 barjin