puppeteer-cluster
                                
                                
                                
                                    puppeteer-cluster copied to clipboard
                            
                            
                            
                        Using blocker extensions in each browser instance
Hej,
I see that this repo is no longer maintained, but maybe someone has an idea about my issue:
I want to crawl URLs from a list while using a specific extension (custom image blocker). I followed the example on crawling the alexa list . For testing purposes I set up a list with 4 URLs and defined the the cluster with the following options:
puppeteerOptions: { headless: false, args: [ '--disable-extensions-except=image_blocker/', '--load-extension=image_blocker/'] }, concurrency: Cluster.CONCURRENCY_CONTEXT, maxConcurrency: 1, timeout: 60000,
While debugging I observe the following situation:
- At first queue the first browser instance is launched with no target URL. However, I can observe that this browser instance indeed has loaded my custom extension.
 - After all four jobs are queued, a second browser is launched, targeting the first URL from the list, which has not loaded my custom extension.
 - After proceeding the job, this instance is closed and new one is launched, also without the extension.
 - After the last job is finished, the very first instance (using the extension) is closed and the cluster finishes.
 
I tried this procedure with the concurrency_browser instead, passing the options using perBrowserOptions array and using a vanilla puppeteer instance with the adblocker loaded. In all three cases I faced the same behaviour.
I try to figure out why does the cluster launch a browser instance before I even declared & queued any jobs? And more important, why are the extensions used by this first browser instance only?
Maybe someone has an idea?
see perBrowserOptions
it might be what you need
see perBrowserOptions
it might be what you need
As mentioned in my post, the perBrowserOptions wasn't helpful.
~@z4lem did you ever figure this out? I also cannot load an extension.~
Setting concurrency to CONCURRENCY_PAGE and using perBrowserOptions worked for me.
~@z4lem did you ever figure this out? I also cannot load an extension.~
Setting
concurrencytoCONCURRENCY_PAGEand usingperBrowserOptionsworked for me.
Thx for your reply. Unfortunately this doesn't work for me:
Setting concurrency to CONCURRENCY_PAGE results in only one browser instance being run, where each page is opened in an own tab.
For this case its  indeed enough to set the extension using the puppeteerOptions.
You dont need to use the perBrowserOptions, because there's only one browser.
In my case, each page must be opened in a separate browser instance which is done using CONCURRENCY_BROWSER. And here, each instance should use my  extension. However, in this case only the first browser instance is using the extension, the following are not :(
Setting
concurrencytoCONCURRENCY_PAGEand usingperBrowserOptionsworked for me.
This concurrency option launches only one browser instance, so how exactly is this working for you then?
Ah apologies, in my use case I was able to modify the code to only use one browser instance.
You can try with this concurrency implementation
import * as puppeteer from "puppeteer";
import { debugGenerator, timeoutExecute } from "puppeteer-cluster/dist/util";
import ConcurrencyImplementation, {
  WorkerInstance,
} from "puppeteer-cluster/dist/concurrency/ConcurrencyImplementation";
const debug = debugGenerator("BrowserConcurrency");
const BROWSER_TIMEOUT = 5000;
export default class Browser extends ConcurrencyImplementation {
  public async init() {}
  public async close() {}
  public async workerInstance(
    perBrowserOptions: puppeteer.LaunchOptions | undefined
  ): Promise<WorkerInstance> {
    const options = perBrowserOptions || this.options;
    let chrome: puppeteer.Browser;
    let page: puppeteer.Page;
    let context: any; // puppeteer typings are old...
    return {
      jobInstance: async () => {
        await timeoutExecute(
          BROWSER_TIMEOUT,
          (async () => {
            chrome = await this.puppeteer.launch(options);
            context = chrome.defaultBrowserContext();
            page = await context.newPage();
          })()
        );
        return {
          resources: {
            page,
          },
          close: async () => {
            await timeoutExecute(BROWSER_TIMEOUT, context.close());
          },
        };
      },
      close: async () => {
        await chrome.close();
      },
      repair: async () => {
        debug("Starting repair");
        try {
          // will probably fail, but just in case the repair was not necessary
          await chrome.close();
        } catch (e) {}
        // just relaunch as there is only one page per browser
      },
    };
  }
}