crawlee
crawlee copied to clipboard
feat(Request): add `requestHandlerTimeoutSecs` and `requestTimeoutSecs` options
closes #1485
Adds requestHandlerTimeoutSecs
and requestTimeoutSecs
options to a request to allow for request-specific timeouts.
If these optional parameters aren't provided in RequestOptions
, the crawler will fallback to the timeouts provided in its configuration.
This is not done. Going to work on a better solution other than reassigning these values all over the place - it gets messy. These new request options need to be handled in BasicCrawler
, HttpCrawler
, and BrowserCrawler
. Work in progress.
If we want to persist the information (e.g. to make this work on platform even after migration), we should use the same approach as with skipNavigation
- it gets stored in the userData.__crawlee
.
These new request options need to be handled in BasicCrawler, HttpCrawler, and BrowserCrawler
Isnt this all about basic crawler, the rest builds on top of it? I can imagine some protected method there (so we can override in extension classes) to calculate timeout for a given request instance, like crawler.getRequestTimeout(req)
.
@B4nan What about making it look like this:
router.addHandler('label-a', async (ctx) => {
ctx.log.info('...');
}, { timeoutSecs: 5 });
@B4nan The problem (in my opinion) with the idea Szymon is proposing is that it would only allow the user to specify label-specific requestHandlerTimeoutSecs
. If the options are right within RequestOptions
, we can allow the user to control requestHandlerTimeoutSecs
AND navigationTimeoutSecs
for specific requests. It allows for more control.
I would rather support something like this, feels more natural (and having options parameter after a callback one is a weird DX):
router.addHandler('label-a', async ({ request, log }) => {
request.timeoutSecs = 5;
log.info('...');
});
(and this means those options would be part of RequestOptions
too, I can see the value in the nav timeout)
But we can support both ways in the end if it makes sense.
Keep mind mind that a request is labeled, so this above is also "label specific timeout", I dont see a difference with the Szymon's version in this regard.
It's different because you could do something like this:
await crawler.run([{ url: 'foo', label: 'TEST', timeoutSecs: 5 }, { url: 'foo', label: 'TEST', timeoutSecs: 10 }])
Same label, but different requestHandlerTimeoutSecs
since the timeout is configured on the request object instead of the handler.
Yes, and one of that would have priority, two ways to do the same.
(not that the proposal is about router.addHandler
, not crawler.run
, but I guess we are aligned on that)
Yeah not talking specifically about addHandler
or run
, but just about how these should be directly on the Request
object. I'll take a look at what we did for skipNavigation
and refactor the changes in this PR. Will also rename the requestTimeoutSecs
option to navigationTimeoutSecs
to match crawler options.
Right, I would definitely do the request options first, then we can think about the router API. The initial goal here to me was to allow modifying this e.g. via router middleware - there you can label stuff as well as modify the timeouts, before the corrent request handler is picked and executed.
This feature requires not only changes in the Request
object, but also in BasicCrawler
, HttpCrawler
, and BrowserCrawler
. BasicCrawler
has requestHandlerTimeoutSecs
, which is used differently in both HttpCrawler
(this crawler has its own userRequestHandlerTimeoutSecs
) and BasicCrawler
. It's a tightly coupled OOP mess with state values and overrides all over the place, and many things need to be changed in many places to make this work smoothly in all crawlers while also maintaining previous behaviors.
As someone not completely familiar with this codebase, I could easily forget something and unknowingly break stuff. @szmarczak would you feel comfortable taking a crack at it?
You lost me at OOP mess :D