crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

PlaywrightCrawler doesn't have gotoOptions

Open phughesion-h3 opened this issue 1 month ago • 2 comments

In the JavaScript version, the PuppeteerCrawler has gotoOptions, which I believe allows you to define what wait_until state you want. https://crawlee.dev/js/api/puppeteer-crawler#PuppeteerGoToOptions

The PlaywrightCrawler just uses the default page.goto, which defaults to "load". https://github.com/apify/crawlee-python/blob/9d4ae6439c301abe7439281a5786b8f166d67623/src/crawlee/crawlers/_playwright/_playwright_crawler.py#L300C1-L301C1

Some sites take ages to load and I would like my request_handler to run after "domcontentloaded", since I don't need to wait for the full page to load to get what I need. As it is now, my request_handler will never be called because the site has an issue preventing it from loading all of the way.

I don't just want to increase the timeout, I want to be able to specify what options _navigate should use when calling goto.

phughesion-h3 avatar Nov 24 '25 18:11 phughesion-h3

Hello @phughesion-h3 and thanks for using Crawlee for Python 🙂 In the JS version, the PuppeteerGoToOptions interface (or PlaywrightGoToOptions) is passed to the pre-navigation hooks which are allowed to modify it and thus configure how page.goto is going to behave.

As you wrote, this functionality is currently missing from the Python version - we will fix that.

One open question - configuring how page.goto is going to behave via modifying an argument to pre-navigation hooks is not optimal in terms of user discoverability - is there any better approach? @vdusek @Pijukatel @Mantisus

janbuchar avatar Nov 25 '25 10:11 janbuchar

configuring how page.goto is going to behave via modifying an argument to pre-navigation hooks is not optimal in terms of user discoverability - is there any better approach?

Perhaps set it as one of the input parameters of PlaywrightCrawler and pass it to PlaywrightPreNavCrawlingContext so that the user can change the option parameters for specific URLs in pre_navigation_hook.

UPD: It is also likely that this should be done after completing this PR #1474. This is to ensure that the navigation options do not conflict with request_handler_timeout.

Mantisus avatar Nov 25 '25 14:11 Mantisus