curl icon indicating copy to clipboard operation
curl copied to clipboard

Vectorized curl_download?

Open hammer opened this issue 7 years ago • 9 comments

I often have a character vector of URLs that represent data files that I'd like to download to a local directory. I can use purrr:map2 with curl_download pretty easily to grab these files; however, I'm a little sad that I can't just pass vectors to curl_download. I know I should probably use curl_fetch_multi for large vectors but for the usual case I am getting a few dozen files from a reliable server and don't want to go to the trouble of writing callback handlers for an async API.

hammer avatar Oct 11 '18 20:10 hammer

I have thought about this but I am not sure how to do proper exception handling that way. What should happen if 1 of the downloads fails but the other ones are OK? Should it raise an error and delete all the files? Or just return FALSE for the files that failed?

jeroen avatar Oct 11 '18 20:10 jeroen

It should be possible to design a good API. E.g. for failures you would want to return an error object for that download, not just FALSE.

gaborcsardi avatar Oct 11 '18 20:10 gaborcsardi

What about the destfile argument? Should it be a vector of equal length as the input url? Or should the user be able to specify a directory, and curl will automatically guess the filesnames and save them to that directory?

jeroen avatar Oct 18 '18 11:10 jeroen

Should it be a vector of equal length as the input url?

I think that's a good start. We can add the directory approach later. That one is tricky, because you need to sanitize the output file names. E.g. /etc/passwd should probably not be allowed.

gaborcsardi avatar Oct 18 '18 11:10 gaborcsardi

And a lot of URL's done have an obvious filename, i.e when they are a REST endpoint to some oid.

jeroen avatar Oct 18 '18 12:10 jeroen

Yeah, another good reason. I think just a vector of output file names is fine.

gaborcsardi avatar Oct 18 '18 12:10 gaborcsardi

Sorry to reactivate that issue but I have a question related to that topic (it seems)!

You raise the fact that we can use map and curl_download but it will dl files one by one right? Is there a better way to download several files async than doing something like test <- c("http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.jpg", "http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.nc") walk2(test, basename(test), ~ curl_fetch_multi(.x, done = cb, pool = pool, data = file(.y, open = "wb"))) I'm struggling with the data args... sorry if it's obvious :s thanks & happy new year

lemairev avatar Jan 03 '19 18:01 lemairev

@lemairev You could do the parallelism at the process level by creating a cluster and using parLapply/foreach/etc.

clus <- parallel::makeCluster(10) # or whatever number you want
parallel::clusterMap(clus, function(src, dest)
{
    curl::curl_fetch_disk(src, dest) # don't forget to check the result
}, srcfiles, destfiles)

This isn't technically asynchronous, but for large numbers of small files it'll still be much faster than downloading sequentially. The number of processes in the cluster isn't so important since you'll generally be constrained by your network bandwidth more than memory or CPU. You can kill the cluster afterwards, or keep it around if you know you're going to be using it again.

hongooi73 avatar Mar 26 '19 17:03 hongooi73

@Hong-Revo Thank you for the suggestion! I'll try that :) cheers

lemairev avatar Apr 10 '19 09:04 lemairev