headless-chrome-crawler
headless-chrome-crawler copied to clipboard
subdomain crawl with "allowedDomains" parameter crawls top domain, too
For the domain "test.domain.com" result.response.url includes urls from "domain.com", too. I tried it with the subdomain name and regexp. I don't understand, why, shouldn't "allowedDomains" parameter prevent scanning from URLs of other domains?
(async () => {
const crawler = await HCCrawler.launch({
headless: true,
args: [
'--ignore-certificate-errors',
'--no-sandbox',
],
allowedDomains: [domain],
maxDepth: 8,
customCrawl: async (page, crawl) => {
const result = await crawl();
result.content = await page.content();
return result;
},
onSuccess: result => {
const values = [
result.response.url
];
},
await crawler.queue(url);
await crawler.onIdle();
await crawler.close().then(() => connection.end());
console.log('Scan completed.');
})();