crawlee feat: add `FileDownload` "crawler"

Adds a new package @crawlee/file-download, which overrides the HttpCrawler's MIME type limitations and allows the users to download arbitrary files.

Aside from the regular requestHandler, this crawler introduces streamHandler, which passes a ReadableStream with the downloaded data to the user handler.

Apr 29 '24 12:04 barjin

Do we really need a new package here? If there are no additional dependencies, I would just expose the new class in the HTTP crawler package.

Apr 29 '24 13:04 B4nan

Yup, I don't have any strong opinion here - the (very short) discussion was at https://github.com/apify/store-website-content-crawler/issues/242

Apr 29 '24 13:04 barjin

The latest commit adds streamHandler, which is mutually exclusive with the requestHandler in the constructor and allows the users to work with the data stream instead of the fully downloaded data.

const crawler = new FileDownload({
    async streamHandler({ stream }) {
        const file = createWriteStream('test.webm');
        pipeline(stream, file, (err) => {
            if (err) {
                console.error('Pipeline failed', err);
            }
        });
    },
})

This is a direct port from WCC. What do you think, is it worth it to keep it here? Or should we rely on the requestHandler only?

May 02 '24 14:05 barjin