headless-chrome-crawler subdomain crawl with "allowedDomains" parameter crawls top domain, too

subdomain crawl with "allowedDomains" parameter crawls top domain, too

Open michaelpapesch opened this issue 3 years ago • 0 comments

For the domain "test.domain.com" result.response.url includes urls from "domain.com", too. I tried it with the subdomain name and regexp. I don't understand, why, shouldn't "allowedDomains" parameter prevent scanning from URLs of other domains?

(async () => {
    const crawler = await HCCrawler.launch({
        headless: true,
        args: [
            '--ignore-certificate-errors',
            '--no-sandbox',
        ],
        allowedDomains: [domain],
        maxDepth: 8,
        customCrawl: async (page, crawl) => {
            const result = await crawl();
            result.content = await page.content();
            return result;
        },
        onSuccess: result => {
            const values = [
                result.response.url
            ];
        },
    await crawler.queue(url);
    await crawler.onIdle();
    await crawler.close().then(() => connection.end());
    console.log('Scan completed.');
})();

Nov 29 '21 11:11 michaelpapesch

headless-chrome-crawler headless-chrome-crawler copied to clipboard

subdomain crawl with "allowedDomains" parameter crawls top domain, too

headless-chrome-crawler
headless-chrome-crawler copied to clipboard