browsertrix-behaviors icon indicating copy to clipboard operation
browsertrix-behaviors copied to clipboard

Use addLink in behaviors to crawl additional pages without scope limitation

Open cmillet2127 opened this issue 1 year ago • 0 comments

I try to crawl subpages from a main page based on an Xpath expression.

As I can't use window.location.href to crawl additional pages, it throws "Execution context was destroyed". I try to use the ctx.Lib.addLink. After reading the code of browsertrix-crawler, it seems addLink callback is not set in my case. It seems also, when addLink is set, it is restricted by the scopeType.

Url to crawl : https://group.bnpparibas/toutes-actualites/communique-de-presse

Behavior to crawl additional pages (the first 8 articles)

` class BnpCommuniquesdePresseBehavior { static id = "BnpCommuniquesdePresse";

static init() {
	return {
		state: { links: 0 },
		opts: {}
	};
}

static isMatch() {
	return window.location.href === "https://group.bnpparibas/toutes-actualites/communique-de-presse";
}

async *run(ctx) {
	const { getState, awaitLoad, sleep, xpathNodes, addLink } = ctx.Lib;
	
	yield getState(ctx, "BnpCommuniquesdePresseBehavior starting...");
	
	const aTags = Array.from(xpathNodes("//main//div//div//div//div//div//ul/li[position() <= 8]/article/a"));

	if (aTags && aTags.length) {
		yield getState(ctx, aTags.length + " hrefs found");
		for await (const aTag of aTags) {
			await addLink(aTag.href);
			yield getState(ctx, "Add a link to crawl: " + aTag.href, "links");
		}
	}
	else
		yield getState(ctx, "no link found");
	yield getState(ctx, "BnpCommuniquesdePresseBehavior done");
}

} `

The docker command line docker run -p 6080:6080 -p 9223:9223 -v c:\tmp\crawls\:/crawls/ -v c:\tmp\custom-behaviors\:/custom-behaviors/ -it webrecorder/browsertrix-crawler:latest crawl --url https://group.bnpparibas/toutes-actualites/communique-de-presse --generateWACZ final-to-warc --text --wait-until domcontentloaded --screenshot thumbnail,view,fullPage --scopeType page --customBehaviors /custom-behaviors/ --pageLimit 10 --screencastPort 9223 --profile "/crawls/profiles/group.bnpparibas.tar.gz" --behaviors siteSpecific

cmillet2127 avatar May 08 '24 13:05 cmillet2127