crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Downloading Files

Open s0kil opened this issue 4 years ago • 11 comments

Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile pipeline which seems to be used for saving data into one file, CSV, JSON, etc.

s0kil avatar Apr 03 '20 22:04 s0kil

If you do not mind, could you also mention streaming large files to disk.

s0kil avatar Apr 03 '20 22:04 s0kil

https://stackoverflow.com/questions/30267943/elixir-download-a-file-image-from-a-url

Use a custom pipeline to manage the downloading . In your spider, scrape the media urls and pass it as a nested map key. Then pattern match on it.

https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines.

Crawly processes the items sequentially, but for long downloads you might want to offload it to a queue or use an async Task to download it.

Ziinc avatar Apr 04 '20 06:04 Ziinc

@s0kil I think @Ziinc gave a good answer, pipeline is a good way to go! Otherwise, in my own projects, I am downloading media from the parse_item callback directly. Crawly is a queue management system itself, so technically your worker will just spend a bit more time downloading the image, that's it.

@Ziinc shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

oltarasenko avatar Apr 05 '20 17:04 oltarasenko

@oltarasenko Will downloading a larger file in parse_item block the spider from continuing to crawl and parse?

s0kil avatar Apr 05 '20 17:04 s0kil

No, it does not block the Crawly itself, just one worker which is downloading something, but all other workers are operational. (Comparing it with Scrapy, where non-reactor based downloads will block the world, Crawly operates without problems)

oltarasenko avatar Apr 05 '20 17:04 oltarasenko

Is it too much to ask for an example project such https://github.com/oltarasenko/crawly-spider-example, saving the blog posts into each individual file?

s0kil avatar Apr 05 '20 17:04 s0kil

@oltarasenko sounds like a good idea, i'll think a bit more about the api and update here. I should have time for it in the coming weeks.

@s0kil i think it would be more appropriate to have a how-to article in the docs. There are some inherent issues with having many example repos, such as maintenance and keeping them in sync.

Ziinc avatar Apr 05 '20 18:04 Ziinc

@s0kil could you give some info on how you are working around the downloading of files now?

Ziinc avatar Apr 21 '20 17:04 Ziinc

shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

Just a heads-up - I've started working on such generic pipeline today.

michaltrzcinka avatar Apr 22 '20 11:04 michaltrzcinka

@Ziinc I could not get it working yet.

s0kil avatar Apr 22 '20 14:04 s0kil

@oltarasenko I will implement a generic supervised task execution process as mentioned here https://github.com/oltarasenko/crawly/pull/88#issuecomment-626103255 for pipelines to hook into.

Ziinc avatar May 16 '20 14:05 Ziinc