Vectorized curl_download?
I often have a character vector of URLs that represent data files that I'd like to download to a local directory. I can use purrr:map2 with curl_download pretty easily to grab these files; however, I'm a little sad that I can't just pass vectors to curl_download. I know I should probably use curl_fetch_multi for large vectors but for the usual case I am getting a few dozen files from a reliable server and don't want to go to the trouble of writing callback handlers for an async API.
I have thought about this but I am not sure how to do proper exception handling that way. What should happen if 1 of the downloads fails but the other ones are OK? Should it raise an error and delete all the files? Or just return FALSE for the files that failed?
It should be possible to design a good API. E.g. for failures you would want to return an error object for that download, not just FALSE.
What about the destfile argument? Should it be a vector of equal length as the input url? Or should the user be able to specify a directory, and curl will automatically guess the filesnames and save them to that directory?
Should it be a vector of equal length as the input url?
I think that's a good start. We can add the directory approach later. That one is tricky, because you need to sanitize the output file names. E.g. /etc/passwd should probably not be allowed.
And a lot of URL's done have an obvious filename, i.e when they are a REST endpoint to some oid.
Yeah, another good reason. I think just a vector of output file names is fine.
Sorry to reactivate that issue but I have a question related to that topic (it seems)!
You raise the fact that we can use map and curl_download but it will dl files one by one right?
Is there a better way to download several files async than doing something like
test <- c("http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.jpg", "http://www.prevair.org/donneesmisadispo/public/PREVAIR.analyse.20181205.MAXJ.NO2.public.nc") walk2(test, basename(test), ~ curl_fetch_multi(.x, done = cb, pool = pool, data = file(.y, open = "wb")))
I'm struggling with the data args... sorry if it's obvious :s
thanks & happy new year
@lemairev You could do the parallelism at the process level by creating a cluster and using parLapply/foreach/etc.
clus <- parallel::makeCluster(10) # or whatever number you want
parallel::clusterMap(clus, function(src, dest)
{
curl::curl_fetch_disk(src, dest) # don't forget to check the result
}, srcfiles, destfiles)
This isn't technically asynchronous, but for large numbers of small files it'll still be much faster than downloading sequentially. The number of processes in the cluster isn't so important since you'll generally be constrained by your network bandwidth more than memory or CPU. You can kill the cluster afterwards, or keep it around if you know you're going to be using it again.
@Hong-Revo Thank you for the suggestion! I'll try that :) cheers