paws icon indicating copy to clipboard operation
paws copied to clipboard

Functionality to copy s3 file directly to local disk without table intermediate

Open LBragg opened this issue 3 years ago • 9 comments

Hi. Thanks for the great package. It's very comprehensive.

I am looking for a function equivalent to boto3's download file (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file) , which can also be done via the AWS cli using 'cp' -- copying a file from an s3 bucket to a local file location.

I may have missed something in the docs, any help would be great. Thanks!

LBragg avatar Apr 03 '21 05:04 LBragg

Thanks for submitting this issue! Unfortunately, we do not currently have a function like that. The only way we currently support downloads is by retrieving the content in binary format and then writing that to a file as shown here: https://github.com/paws-r/paws/blob/main/examples/s3.R#L66 .

I agree that it would be nice to create a function that supports this use case, so we will put this on our to-do list.

adambanker avatar Apr 03 '21 16:04 adambanker

I'm interested in this functionality too, and I'd be happy to help implementing it. Do you have any existing examples of paws arguments that don't correspond to API arguments?

hadley avatar May 19 '21 11:05 hadley

I added the functionality in #408 but it is not available to use yet - but should be soon. That pull request adds the custom download_file function to our sdk generator package - make.paws. The next time we regenerate the paws packages, that function will be added to paws.storage, and therefore available to easily install.

If you would like to use the functionality before then, you can pull the latest version of main, and regenerate the packages using the makefile - make install. However, we were planning on regenerating the packages this week anyway, so it should be available soon without having to do that.

adambanker avatar May 19 '21 12:05 adambanker

#408 doesn't really solve the problem since the data still has to travel through R — an ideal solution would be for libcurl (via curl via httr) to save the data directly to disk. (I have a dim recommendation that this is currently hard/impossible to do correctly in httr, but I'm happy to help resolve that.)

hadley avatar May 19 '21 13:05 hadley

That's a good point. We haven't yet looked into a solution that would bypass R, but that does sound better.

I don't believe we have added an argument to a function that is not an api argument. The closest to that is the wrapper function created in #408. As far as other examples of customizations, there are examples of custom handlers here.

adambanker avatar May 19 '21 14:05 adambanker

Hi all,

I took a stab at a possible solution. The approach I took was to pass httr::write_disk to httr::VERB when a output location is given.

Edit: Updated issue function https://github.com/DyfanJones/paws/blob/0bb0487f34d18bcab8304c4068c08fc30eb0fbf8/paws.common/R/net.R#L97-L140

# Issue an HTTP request.
issue <- function(http_request) {
  method <- http_request$method
  url <- build_url(http_request$url)
  headers <- unlist(http_request$header)
  if (http_request$close) {
    headers["Connection"] <- "close"
  }
  body <- http_request$body
  timeout <- httr::config(connecttimeout = http_request$timeout)
  if (is.null(http_request$timeout)) timeout <- NULL

  if (url == "") {
    stop("no url provided")
  }

  # utilize httr to write to disk
  output <- NULL
  if(!is.null(http_request$output))
    output <- httr::write_disk(http_request$output)

  r <- httr::VERB(
    method,
    url = url,
    config = c(httr::add_headers(.headers=headers), output),
    body = body,
    timeout
  )

  response <- HttpResponse(
    status_code = httr::status_code(r),
    header = httr::headers(r),
    content_length = as.integer(httr::headers(r)$`content-length`),
    body = (if(is.null(http_request$output)) httr::content(r, as = "raw") else raw()) # prevent reading in data when output is not null
  )

  # Decode gzipped response bodies that are not automatically decompressed
  # by httr/curl.
  if (is_compressed(response)) {
    response <- decompress(response)
  }

  return(response)
}

This resulted in modifying s3_download_file to look fairly similar to s3_get_object

https://github.com/DyfanJones/paws/blob/3c84100ed257fe66d6a20c00a54d15136a438f7c/make.paws/R/custom/s3.R#L114-L128

s3_download_file <- function(Bucket, Key, Filename, IfMatch = NULL, IfModifiedSince = NULL, IfNoneMatch = NULL, IfUnmodifiedSince = NULL, Range = NULL, ResponseCacheControl = NULL, ResponseContentDisposition = NULL, ResponseContentEncoding = NULL, ResponseContentLanguage = NULL, ResponseContentType = NULL, ResponseExpires = NULL, VersionId = NULL, SSECustomerAlgorithm = NULL, SSECustomerKey = NULL, SSECustomerKeyMD5 = NULL, RequestPayer = NULL, PartNumber = NULL, ExpectedBucketOwner = NULL) {
  op <- new_operation(
    name = "GetObject",
    http_method = "GET",
    http_path = "/{Bucket}/{Key+}",
    paginator = list()
  )
  input <- .s3$get_object_input(Bucket = Bucket, IfMatch = IfMatch, IfModifiedSince = IfModifiedSince, IfNoneMatch = IfNoneMatch, IfUnmodifiedSince = IfUnmodifiedSince, Key = Key, Range = Range, ResponseCacheControl = ResponseCacheControl, ResponseContentDisposition = ResponseContentDisposition, ResponseContentEncoding = ResponseContentEncoding, ResponseContentLanguage = ResponseContentLanguage, ResponseContentType = ResponseContentType, ResponseExpires = ResponseExpires, VersionId = VersionId, SSECustomerAlgorithm = SSECustomerAlgorithm, SSECustomerKey = SSECustomerKey, SSECustomerKeyMD5 = SSECustomerKeyMD5, RequestPayer = RequestPayer, PartNumber = PartNumber, ExpectedBucketOwner = ExpectedBucketOwner)
  output <- .s3$get_object_output()
  config <- get_config()
  svc <- .s3$service(config)
  request <- new_request(svc, op, input, output, Filename)
  response <- send_request(request)
  return(list())
}

I am not 100% sure if this still returns the content back to R as I don't know the lower level of httr. @hadley would you be able to shed any light on this?

If this is more the right lines I will raise a PR.

DyfanJones avatar Sep 09 '21 14:09 DyfanJones

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

hadley avatar Sep 09 '21 17:09 hadley

@DyfanJones I've already forgotten how httr works, but the chances are it'll be much easier with httr2

@hadley no worries, from a quick look at httr, it should call curl::curl_fetch_disk (https://github.com/r-lib/httr/blob/80cb625117629105874d58a234f061b458eefbe6/R/write-function.R#L81-L85) when httr::write_disk is pass to httr::VERB.

httr2 looks really promising :) however I will let the package authors (@davidkretch, @adambanker) make the decision to upgrade from httr 😄 .

I believe this method prevents the data going through R and leaves it to libcurl to do the processing. I will check the unit tests and raise a PR 😄

DyfanJones avatar Sep 10 '21 08:09 DyfanJones

PR (#458) has been merged and will be in the next paws release.

DyfanJones avatar Aug 16 '22 16:08 DyfanJones

Closing this ticket due to paws.storage 0.2.0 has been released to the cran.

DyfanJones avatar Feb 08 '23 14:02 DyfanJones