`FileDownload` waits indefinitely on unconsumed stream
Due to the design of Crawlee request handlers, the user-supplied request handler can return before the response stream is consumed. Because of this, we are waiting until the stream is fully read before considering the request processed (link).
If the user decides not to consume the stream at all, the crawler will hang indefinitely.
import { FileDownload } from '@crawlee/http';
const crawler = new FileDownload({
streamHandler: ({ request }) => {
console.log(`Downloading: ${request.url}`);
},
});
crawler.run(['https://crawlee.dev/img/crawlee-light.svg'])
Nitpick - in v4, we don't have a streamHandler anymore - it's all done in requestHandler.
Maybe we should, upon returning from the request handler, check if the stream has been touched at all. If not, we can either destroy it or throw a critical error, whichever feels better.
If we see that somebody started using it, we should probably wait until it's read (there can be a timeout for consuming the whole thing, or we can have a timer that "resets" each time somebody reads from the stream).
All of this demands either some Proxy shenanigans (completely warranted IMO), or wrapping the stream in a custom class.
One last concern - do we expose the Response along with the stream of the body to the request handler in other crawler types? How about sendRequest - it could sting if we could leak streams because of that.