puppeteer-cluster icon indicating copy to clipboard operation
puppeteer-cluster copied to clipboard

Using blocker extensions in each browser instance

Open z4lem opened this issue 4 years ago • 7 comments

Hej,

I see that this repo is no longer maintained, but maybe someone has an idea about my issue:

I want to crawl URLs from a list while using a specific extension (custom image blocker). I followed the example on crawling the alexa list . For testing purposes I set up a list with 4 URLs and defined the the cluster with the following options:

puppeteerOptions: { headless: false, args: [ '--disable-extensions-except=image_blocker/', '--load-extension=image_blocker/'] }, concurrency: Cluster.CONCURRENCY_CONTEXT, maxConcurrency: 1, timeout: 60000,

While debugging I observe the following situation:

  1. At first queue the first browser instance is launched with no target URL. However, I can observe that this browser instance indeed has loaded my custom extension.
  2. After all four jobs are queued, a second browser is launched, targeting the first URL from the list, which has not loaded my custom extension.
  3. After proceeding the job, this instance is closed and new one is launched, also without the extension.
  4. After the last job is finished, the very first instance (using the extension) is closed and the cluster finishes.

I tried this procedure with the concurrency_browser instead, passing the options using perBrowserOptions array and using a vanilla puppeteer instance with the adblocker loaded. In all three cases I faced the same behaviour.

I try to figure out why does the cluster launch a browser instance before I even declared & queued any jobs? And more important, why are the extensions used by this first browser instance only?

Maybe someone has an idea?

z4lem avatar Jun 03 '21 13:06 z4lem

see perBrowserOptions

it might be what you need

LordRampantHump avatar Jul 23 '21 17:07 LordRampantHump

see perBrowserOptions

it might be what you need

As mentioned in my post, the perBrowserOptions wasn't helpful.

z4lem avatar Jul 27 '21 14:07 z4lem

~@z4lem did you ever figure this out? I also cannot load an extension.~

Setting concurrency to CONCURRENCY_PAGE and using perBrowserOptions worked for me.

squirrelsquirrel78 avatar Nov 18 '21 20:11 squirrelsquirrel78

~@z4lem did you ever figure this out? I also cannot load an extension.~

Setting concurrency to CONCURRENCY_PAGE and using perBrowserOptions worked for me.

Thx for your reply. Unfortunately this doesn't work for me: Setting concurrency to CONCURRENCY_PAGE results in only one browser instance being run, where each page is opened in an own tab. For this case its indeed enough to set the extension using the puppeteerOptions. You dont need to use the perBrowserOptions, because there's only one browser.

In my case, each page must be opened in a separate browser instance which is done using CONCURRENCY_BROWSER. And here, each instance should use my extension. However, in this case only the first browser instance is using the extension, the following are not :(

z4lem avatar Dec 06 '21 16:12 z4lem

Setting concurrency to CONCURRENCY_PAGE and using perBrowserOptions worked for me.

This concurrency option launches only one browser instance, so how exactly is this working for you then?

z4lem avatar Jan 14 '22 13:01 z4lem

Ah apologies, in my use case I was able to modify the code to only use one browser instance.

squirrelsquirrel78 avatar Jan 18 '22 21:01 squirrelsquirrel78

You can try with this concurrency implementation

import * as puppeteer from "puppeteer";

import { debugGenerator, timeoutExecute } from "puppeteer-cluster/dist/util";
import ConcurrencyImplementation, {
  WorkerInstance,
} from "puppeteer-cluster/dist/concurrency/ConcurrencyImplementation";
const debug = debugGenerator("BrowserConcurrency");

const BROWSER_TIMEOUT = 5000;

export default class Browser extends ConcurrencyImplementation {
  public async init() {}
  public async close() {}

  public async workerInstance(
    perBrowserOptions: puppeteer.LaunchOptions | undefined
  ): Promise<WorkerInstance> {
    const options = perBrowserOptions || this.options;
    let chrome: puppeteer.Browser;
    let page: puppeteer.Page;
    let context: any; // puppeteer typings are old...

    return {
      jobInstance: async () => {
        await timeoutExecute(
          BROWSER_TIMEOUT,
          (async () => {
            chrome = await this.puppeteer.launch(options);
            context = chrome.defaultBrowserContext();
            page = await context.newPage();
          })()
        );

        return {
          resources: {
            page,
          },

          close: async () => {
            await timeoutExecute(BROWSER_TIMEOUT, context.close());
          },
        };
      },

      close: async () => {
        await chrome.close();
      },

      repair: async () => {
        debug("Starting repair");
        try {
          // will probably fail, but just in case the repair was not necessary
          await chrome.close();
        } catch (e) {}

        // just relaunch as there is only one page per browser
      },
    };
  }
}

Distil62 avatar May 12 '22 15:05 Distil62