crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

`FileDownload` waits indefinitely on unconsumed stream

Open barjin opened this issue 1 month ago • 1 comments

Due to the design of Crawlee request handlers, the user-supplied request handler can return before the response stream is consumed. Because of this, we are waiting until the stream is fully read before considering the request processed (link).

If the user decides not to consume the stream at all, the crawler will hang indefinitely.

import { FileDownload } from '@crawlee/http';

const crawler = new FileDownload({
    streamHandler: ({ request }) => {
        console.log(`Downloading: ${request.url}`);
    },
});

crawler.run(['https://crawlee.dev/img/crawlee-light.svg'])

barjin avatar Nov 27 '25 11:11 barjin

Nitpick - in v4, we don't have a streamHandler anymore - it's all done in requestHandler.

Maybe we should, upon returning from the request handler, check if the stream has been touched at all. If not, we can either destroy it or throw a critical error, whichever feels better.

If we see that somebody started using it, we should probably wait until it's read (there can be a timeout for consuming the whole thing, or we can have a timer that "resets" each time somebody reads from the stream).

All of this demands either some Proxy shenanigans (completely warranted IMO), or wrapping the stream in a custom class.

One last concern - do we expose the Response along with the stream of the body to the request handler in other crawler types? How about sendRequest - it could sting if we could leak streams because of that.

janbuchar avatar Nov 27 '25 12:11 janbuchar