browsertrix-behaviors
browsertrix-behaviors copied to clipboard
Use addLink in behaviors to crawl additional pages without scope limitation
I try to crawl subpages from a main page based on an Xpath expression.
As I can't use window.location.href to crawl additional pages, it throws "Execution context was destroyed". I try to use the ctx.Lib.addLink. After reading the code of browsertrix-crawler, it seems addLink callback is not set in my case. It seems also, when addLink is set, it is restricted by the scopeType.
Url to crawl : https://group.bnpparibas/toutes-actualites/communique-de-presse
Behavior to crawl additional pages (the first 8 articles)
` class BnpCommuniquesdePresseBehavior { static id = "BnpCommuniquesdePresse";
static init() {
return {
state: { links: 0 },
opts: {}
};
}
static isMatch() {
return window.location.href === "https://group.bnpparibas/toutes-actualites/communique-de-presse";
}
async *run(ctx) {
const { getState, awaitLoad, sleep, xpathNodes, addLink } = ctx.Lib;
yield getState(ctx, "BnpCommuniquesdePresseBehavior starting...");
const aTags = Array.from(xpathNodes("//main//div//div//div//div//div//ul/li[position() <= 8]/article/a"));
if (aTags && aTags.length) {
yield getState(ctx, aTags.length + " hrefs found");
for await (const aTag of aTags) {
await addLink(aTag.href);
yield getState(ctx, "Add a link to crawl: " + aTag.href, "links");
}
}
else
yield getState(ctx, "no link found");
yield getState(ctx, "BnpCommuniquesdePresseBehavior done");
}
} `
The docker command line
docker run -p 6080:6080 -p 9223:9223 -v c:\tmp\crawls\:/crawls/ -v c:\tmp\custom-behaviors\:/custom-behaviors/ -it webrecorder/browsertrix-crawler:latest crawl --url https://group.bnpparibas/toutes-actualites/communique-de-presse --generateWACZ final-to-warc --text --wait-until domcontentloaded --screenshot thumbnail,view,fullPage --scopeType page --customBehaviors /custom-behaviors/ --pageLimit 10 --screencastPort 9223 --profile "/crawls/profiles/group.bnpparibas.tar.gz" --behaviors siteSpecific