crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat(Request): add `requestHandlerTimeoutSecs` and `requestTimeoutSecs` options

Open mstephen19 opened this issue 2 years ago • 11 comments

closes #1485

Adds requestHandlerTimeoutSecs and requestTimeoutSecs options to a request to allow for request-specific timeouts.

If these optional parameters aren't provided in RequestOptions, the crawler will fallback to the timeouts provided in its configuration.

mstephen19 avatar Sep 21 '22 22:09 mstephen19

This is not done. Going to work on a better solution other than reassigning these values all over the place - it gets messy. These new request options need to be handled in BasicCrawler, HttpCrawler, and BrowserCrawler. Work in progress.

mstephen19 avatar Sep 22 '22 09:09 mstephen19

If we want to persist the information (e.g. to make this work on platform even after migration), we should use the same approach as with skipNavigation - it gets stored in the userData.__crawlee.

These new request options need to be handled in BasicCrawler, HttpCrawler, and BrowserCrawler

Isnt this all about basic crawler, the rest builds on top of it? I can imagine some protected method there (so we can override in extension classes) to calculate timeout for a given request instance, like crawler.getRequestTimeout(req).

B4nan avatar Sep 23 '22 06:09 B4nan

@B4nan What about making it look like this:

router.addHandler('label-a', async (ctx) => {
   ctx.log.info('...');
}, { timeoutSecs: 5 });

szmarczak avatar Sep 23 '22 07:09 szmarczak

@B4nan The problem (in my opinion) with the idea Szymon is proposing is that it would only allow the user to specify label-specific requestHandlerTimeoutSecs. If the options are right within RequestOptions, we can allow the user to control requestHandlerTimeoutSecs AND navigationTimeoutSecs for specific requests. It allows for more control.

mstephen19 avatar Sep 23 '22 07:09 mstephen19

I would rather support something like this, feels more natural (and having options parameter after a callback one is a weird DX):

router.addHandler('label-a', async ({ request, log }) => {
    request.timeoutSecs = 5;
    log.info('...');
});

(and this means those options would be part of RequestOptions too, I can see the value in the nav timeout)

But we can support both ways in the end if it makes sense.

Keep mind mind that a request is labeled, so this above is also "label specific timeout", I dont see a difference with the Szymon's version in this regard.

B4nan avatar Sep 23 '22 07:09 B4nan

It's different because you could do something like this:

await crawler.run([{ url: 'foo', label: 'TEST', timeoutSecs: 5 }, { url: 'foo', label: 'TEST', timeoutSecs: 10 }])

Same label, but different requestHandlerTimeoutSecs since the timeout is configured on the request object instead of the handler.

mstephen19 avatar Sep 23 '22 07:09 mstephen19

Yes, and one of that would have priority, two ways to do the same.

(not that the proposal is about router.addHandler, not crawler.run, but I guess we are aligned on that)

B4nan avatar Sep 23 '22 07:09 B4nan

Yeah not talking specifically about addHandler or run, but just about how these should be directly on the Request object. I'll take a look at what we did for skipNavigation and refactor the changes in this PR. Will also rename the requestTimeoutSecs option to navigationTimeoutSecs to match crawler options.

mstephen19 avatar Sep 23 '22 07:09 mstephen19

Right, I would definitely do the request options first, then we can think about the router API. The initial goal here to me was to allow modifying this e.g. via router middleware - there you can label stuff as well as modify the timeouts, before the corrent request handler is picked and executed.

B4nan avatar Sep 23 '22 08:09 B4nan

This feature requires not only changes in the Request object, but also in BasicCrawler, HttpCrawler, and BrowserCrawler. BasicCrawler has requestHandlerTimeoutSecs, which is used differently in both HttpCrawler (this crawler has its own userRequestHandlerTimeoutSecs) and BasicCrawler. It's a tightly coupled OOP mess with state values and overrides all over the place, and many things need to be changed in many places to make this work smoothly in all crawlers while also maintaining previous behaviors. As someone not completely familiar with this codebase, I could easily forget something and unknowingly break stuff. @szmarczak would you feel comfortable taking a crack at it?

mstephen19 avatar Sep 23 '22 09:09 mstephen19

You lost me at OOP mess :D

B4nan avatar Sep 23 '22 13:09 B4nan