crawlee Mysterious timeout hard-kills `CheerioCrawler` script

As found in one of the community Actors:

obrazek

This doesn't seem to be related to the request timed out after 30 seconds above - instead, it seems as if the tryCancel() call from the @apify/timeout package is reading some wrong data, causing unexpected behavior.

Happening with [email protected], @apify/[email protected].

Apr 16 '24 08:04 barjin

This seems to be caused by the latest changes to the AutoscaledPool after all (see the PR) - by calling addRequests() in a request handler, we now notify the AutoscaledPool, which runs another task (BasicCrawler navigation + request processing) - all this on top of the original requestHandler's stack:

obrazek

If the original requestHandler's timeout runs out while processing the nested request, it most likely causes the issue above.

Needless to say, the pre-3.9.1 version's stack doesn't have this:

obrazek

Both traces are made from the place where the exception above is thrown.

Apr 16 '24 10:04 barjin

It seems that the fix didn't help, the issue is persisting in [email protected].

obrazek

Apr 17 '24 11:04 barjin

See https://github.com/apify/crawlee/pull/2425#issuecomment-2061268072 for more AsyncLocalStorage experiments.

Apr 17 '24 13:04 barjin

Is there some workaround / version that I can pin to avoid this? Expedia reviews scraper is hitting this quite consistently.

Apr 22 '24 11:04 mvolfik

What Crawlee version are you using? This should be mitigated in the current latest (3.9.2).

Apr 22 '24 11:04 barjin

yeah, I'm on 3.9.1. From above

It seems that the fix didn't help, the issue is persisting in [email protected].

I didn't understand that 3.9.2 shouldn't do this. Will upgrade, thanks.

Also, this is probably worth a slack post, since it can affect any crawler on 3.9.1, right?

Apr 22 '24 11:04 mvolfik

Closing this - mitigated by #2425. Not present in the latest Crawlee versions anymore.

May 09 '24 07:05 barjin