paws icon indicating copy to clipboard operation
paws copied to clipboard

Asynchronous API requests

Open wlandau opened this issue 2 years ago • 6 comments

Does paws have a way to send API requests asynchronously, particularly for uploading and downloading to/from S3? I have heard curl has async built in.

wlandau avatar Mar 17 '22 16:03 wlandau

It does not. We did look into it a while back and it is kind of tricky in R because you'd need to let curl run in another process behind something like future, then return a future to the user instead of the result. We never got anything like a working example but I think it's probably technically possible.

davidkretch avatar Mar 25 '22 20:03 davidkretch

One alternative might be to run Paws itself in another process using future.

davidkretch avatar Mar 25 '22 20:03 davidkretch

There is also an open PR for downloading files from S3 directly to disk which I suppose would help when running it in another process.

davidkretch avatar Mar 25 '22 20:03 davidkretch

Hi all,

I have been thinking about this. I think it is a possibly a limitation of the current httr package, as it doesn't call the curl async processes i.e. multi_add, multi_run. The key process we would want to use is curl::curl_fetch_multi. To get it we could extend the current httr package.

First step extend the httr package to include curl::curl_fetch_multi

library(httr)

# Create new multi
write_multi_disk <- function(path, overwrite = FALSE) {
  if (!overwrite && file.exists(path)) {
    stop("Path exists and overwrite is FALSE", call. = FALSE)
  }
  httr:::request(output = write_function("write_multi_disk", path = path, file = NULL))
}

# add method to call curl::curl_fetch_mulit
request_fetch <- function(x, url, handle) UseMethod("request_fetch")
request_fetch.write_multi_disk <- function(x, url, handle) {
  con <- file(x$path)
  curl::curl_fetch_multi(
    url, fail = failure, data = con, handle = handle
  )
  tryCatch({
    curl::multi_run()
  }, interrupt = function(cnd) {
    curl::multi_cancel(handle)
  })
  resp <- curl::handle_data(handle)
  resp$content <- httr:::path(x$path)
  resp
}

# TODO: better failure function to align with paws error handling
failure <- function(msg){
 stop(msg)
}

# Testing new method
r <- httr::VERB(
  "GET",
  url = "https://www.google.com",
  config = write_multi_disk("temp.txt",T)
)

httr::headers(r)
httr::status_code(r)
httr::content(r, as = "raw")

The big issue I see with this is the error handling, however this method could be added/developed long side the current PR #458.

Let me know your thoughts around this @wlandau @davidkretch 😄

DyfanJones avatar Mar 26 '22 11:03 DyfanJones

@davidkretch, thanks for confirming. I thought that might be the case. @DyfanJones, that's a great point. Seems like async would belong in a package like httr. Looks like async is discussed a bit at (https://github.com/r-lib/httr2/issues/1.

wlandau avatar Mar 27 '22 15:03 wlandau

Been thinking about this and I think we can get async s3 downloads using the promises similar to what @davidkretch mentioned here:

One alternative might be to run Paws itself in another process using future.

Here is a basic example

library(paws)
library(promises)

future::plan(future::multisession)

s3 = paws::s3()

s3_async_download = function(Bucket, Key, Filename, svc) {
  then({
    future_promise(svc$download_file(
      Bucket = Bucket,
      Key = Key,
      Filename = Filename
    ), seed = T)
  }, onRejected = function(){
    stop(sprintf("Failed to download s3://%s/%s", Bucket, Key))
  })
}

system.time({
  s3$download_file(
    Bucket = "dummy",
    Key = "myfile.csv",
    Filename = "myfile1.csv"
  )
})
#>    user  system elapsed 
#>   0.873   1.348  33.800

system.time({
  s3_async_download(
    Bucket = "dummy",
    Key = "myfile.csv",
    Filename = "myfile2.csv"
    svc = s3
  )
})
#>    user  system elapsed 
#>   0.063   0.005   0.091

Created on 2022-04-20 by the reprex package (v2.0.1)

Seems to be really promising 😉

DyfanJones avatar Apr 20 '22 15:04 DyfanJones