Mysterious timeout hard-kills `CheerioCrawler` script
As found in one of the community Actors:
This doesn't seem to be related to the request timed out after 30 seconds above - instead, it seems as if the tryCancel() call from the @apify/timeout package is reading some wrong data, causing unexpected behavior.
Happening with [email protected], @apify/[email protected].
This seems to be caused by the latest changes to the AutoscaledPool after all (see the PR) - by calling addRequests() in a request handler, we now notify the AutoscaledPool, which runs another task (BasicCrawler navigation + request processing) - all this on top of the original requestHandler's stack:
If the original requestHandler's timeout runs out while processing the nested request, it most likely causes the issue above.
Needless to say, the pre-3.9.1 version's stack doesn't have this:
Both traces are made from the place where the exception above is thrown.
See https://github.com/apify/crawlee/pull/2425#issuecomment-2061268072 for more AsyncLocalStorage experiments.
Is there some workaround / version that I can pin to avoid this? Expedia reviews scraper is hitting this quite consistently.
What Crawlee version are you using? This should be mitigated in the current latest (3.9.2).
yeah, I'm on 3.9.1. From above
It seems that the fix didn't help, the issue is persisting in [email protected].
I didn't understand that 3.9.2 shouldn't do this. Will upgrade, thanks.
Also, this is probably worth a slack post, since it can affect any crawler on 3.9.1, right?
Closing this - mitigated by #2425. Not present in the latest Crawlee versions anymore.