crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Enqueue strategy check after redirects is not working with adaptive crawler

Open B4nan opened this issue 1 year ago • 3 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

use enqueueLinks() without any parameters in the request handler on https://crawlee.dev/, at some point it will escape the domain and start scraping everything

https://console.apify.com/actors/PFaajt3k6oOp1YRAU/runs/0SfY5Ocr1dgQjhSIS#log

Code sample

import { PlaywrightCrawler } from 'crawlee';
import { Actor } from 'apify';

await Actor.init();

const crawler = new PlaywrightCrawler({
    proxyConfiguration: await Actor.createProxyConfiguration(),
});
crawler.router.addDefaultHandler(async (ctx) => {
    const $ = await ctx.parseWithCheerio();
    const title = $('html title').text();
    const h1 = $('body h1').text();
    const proxy = ctx.proxyInfo?.username;
    ctx.log.info(`processing ${ctx.request.url}`, { title, h1, proxy });
    await ctx.pushData({ url: ctx.request.url, title, h1 });
    await ctx.enqueueLinks();
});
await crawler.run(['https://crawlee.dev/']);
await Actor.exit();

Package version

3.10.3 beta

Node.js version

20

Operating system

No response

Apify platform

  • [X] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

B4nan avatar Jun 07 '24 11:06 B4nan

Thanks for the report! Are you aware if there is a page that redirects elsewhere somewhere in the crawlee docs, or is the actual enqueueStrategy check failing (and not the post-redirect check)?

janbuchar avatar Jun 07 '24 11:06 janbuchar

looking at the storage, it feels like its not about redirects, we have the edit this page links in there too

image

few more links here, i don't think they come from redirect either

image

B4nan avatar Jun 07 '24 11:06 B4nan

it almost feels like the adaptive enqueueLinks is not checking the strategies at all, maybe its not about the post-redirect check at all

B4nan avatar Jun 07 '24 11:06 B4nan