crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

`useIncognitoPages` doesn't rotate fingerprints

Open mnmkng opened this issue 1 year ago • 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

If you run the code with incognito pages, you will always get the same browser. If you comment incognito pages and uncomment one page per browser, you will get different user agents.

Code sample

import { Actor } from "apify";
import { PlaywrightCrawler } from 'crawlee';

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    browserPoolOptions: {
        useFingerprints: true,
        // maxOpenPagesPerBrowser: 1,
    },
    launchContext: {
        useIncognitoPages: true,
    },
    preNavigationHooks: [
        async ({ page }) => {
            page.once('request', async (req) => {
                try {
                    const headers = await req.allHeaders()
                    console.dir(headers);
                } catch (e) {
                    console.log('req inspection failed')
                }
            })
        }
    ],
    requestHandler: async ({ request, page, log}) => {
        const text = await page.innerText('pre');
        log.info(text);
    },
});


await crawler.run([
    'https://api.ipify.org?format=json&a',
    'https://api.ipify.org?format=json&b',
    'https://api.ipify.org?format=json&c',
    'https://api.ipify.org?format=json&d',
    'https://api.ipify.org?format=json&e',
    'https://api.ipify.org?format=json&f',
]);

Package version

3.7.2

Node.js version

18

Operating system

MacOS

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

mnmkng avatar Jan 30 '24 12:01 mnmkng

Seems like a sign of a much larger underlying issue:

New sessions / fingerprints / proxyUrls are generated only on a browser launch.

The following snippet doesn't rotate the fingerprints correctly - all requests are done with one session only. This is because the useIncognitoPages was written with Playwright contexts in mind - we relied on the "newPage() creates a separate environment" invariant, so all the pages/contexts are launched in one browser.

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: true,
},

The following snippet rotates the fingerprints correctly:

sessionPoolOptions: {
    sessionOptions: {
        maxUsageCount: 1,
    },
},
launchContext: {
   useIncognitoPages: false,
},

This works well because an "expired" session throws away the whole browser instance, causing the new pages to launch a whole new browser (see the parallel with the maxOpenPagesPerBrowser, which does the same thing). This is crazy expensive though, while launching and closing a context 100 times in one browser takes ~3.9 seconds, launching and closing a browser 100 times takes 40 seconds.

The entire browser-pool and session rotation logic is quite convoluted and worth a total rewrite.

barjin avatar Feb 14 '24 16:02 barjin