crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat: add `FileDownload` "crawler"

Open barjin opened this issue 1 year ago • 3 comments

Adds a new package @crawlee/file-download, which overrides the HttpCrawler's MIME type limitations and allows the users to download arbitrary files.

Aside from the regular requestHandler, this crawler introduces streamHandler, which passes a ReadableStream with the downloaded data to the user handler.

barjin avatar Apr 29 '24 12:04 barjin

Do we really need a new package here? If there are no additional dependencies, I would just expose the new class in the HTTP crawler package.

B4nan avatar Apr 29 '24 13:04 B4nan

Yup, I don't have any strong opinion here - the (very short) discussion was at https://github.com/apify/store-website-content-crawler/issues/242

barjin avatar Apr 29 '24 13:04 barjin

The latest commit adds streamHandler, which is mutually exclusive with the requestHandler in the constructor and allows the users to work with the data stream instead of the fully downloaded data.

const crawler = new FileDownload({
    async streamHandler({ stream }) {
        const file = createWriteStream('test.webm');
        pipeline(stream, file, (err) => {
            if (err) {
                console.error('Pipeline failed', err);
            }
        });
    },
})

This is a direct port from WCC. What do you think, is it worth it to keep it here? Or should we rely on the requestHandler only?

barjin avatar May 02 '24 14:05 barjin