feat: add `FileDownload` "crawler"
Adds a new package @crawlee/file-download, which overrides the HttpCrawler's MIME type limitations and allows the users to download arbitrary files.
Aside from the regular requestHandler, this crawler introduces streamHandler, which passes a ReadableStream with the downloaded data to the user handler.
Do we really need a new package here? If there are no additional dependencies, I would just expose the new class in the HTTP crawler package.
Yup, I don't have any strong opinion here - the (very short) discussion was at https://github.com/apify/store-website-content-crawler/issues/242
The latest commit adds streamHandler, which is mutually exclusive with the requestHandler in the constructor and allows the users to work with the data stream instead of the fully downloaded data.
const crawler = new FileDownload({
async streamHandler({ stream }) {
const file = createWriteStream('test.webm');
pipeline(stream, file, (err) => {
if (err) {
console.error('Pipeline failed', err);
}
});
},
})
This is a direct port from WCC. What do you think, is it worth it to keep it here? Or should we rely on the requestHandler only?