paws icon indicating copy to clipboard operation
paws copied to clipboard

Stream file directly to s3 bucket without downloading to local disk first

Open selkamand opened this issue 1 year ago • 5 comments

Thanks for the package. Really impressive work!

Question about paws functionality

I was wondering if it's currently possible to take a url, for example https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf

And stream the file directly to an s3 bucket without ever downloading to the local disk.

What I've tried so far As far as i can tell put_object doesn't check if body is a url and so cannot find the file if the following is run:

svc$put_object(Body = "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf", Bucket = "bucketname", Key =  "ieee_talk.pdf")

Next thing i tried was to use a connection object as body of put_object which obviously failed since Body must be a string

Basically, i'm looking for the paws equivalent of the following (uses aws cli)

curl <url> | aws s3 cp - keyname

This feature is particularly useful if the files you want to store on S3 are very large, since you avoid having to download/upload it twice - once to local then once more to s3

TLDR Is it currently possible to copy a remote file using its URL straight into s3 using paws?

selkamand avatar Aug 08 '22 10:08 selkamand

Hi @selkamand,

Would something like this solve your issue?

s3 = paws::s3()
stream <- curl::curl_fetch_memory("https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf")
s3$put_object(Body = stream$content, Bucket = "bucketname", Key =  "ieee_talk.pdf")

DyfanJones avatar Aug 08 '22 17:08 DyfanJones

Thanks @DyfanJones

Use of curl::curl_fetch_memory does avoid saving to disk - but if i understand correctly, requires the target file fits within RAM of local machine.

This is a limitation I would hope to avoid - something i really should have specified in the original question (my bad!)

Is there an alternative that would support streaming the remote file into s3 bucket?

selkamand avatar Aug 10 '22 05:08 selkamand

If you don't want to do it 1 step i.e. download file to memory and upload you can do it using the multipart upload method. Something like this:

library(httr2)
library(paws)

Bucket = "your_bucket"
Key = "my_file"
upload_no <- new.env(parent = emptyenv())
upload_no$i <- 1
upload_no$parts <- list()

s3 <- paws::s3()

upload_id = s3$create_multipart_upload(
    Bucket = Bucket, Key = Key,
  )$UploadId

s3_upload_part <- function(x){
  etag <- s3$upload_part(
    Body = x,
    Bucket = Bucket,
    Key = Key,
    PartNumber = upload_no$i,
    UploadId = upload_id
  )$ETag
  upload_no$parts[[upload_no$i]] <- list(ETag = etag, PartNumber = upload_no$i)
  upload_no$i <- upload_no$i + 1
  return(NULL)
}

tryCatch(
  resp <- request("your url") %>%
    req_stream(s3_upload_part, buffer_kb = 5 * 1024)
error = function(e){
  s3$abort_multipart_upload(
    Bucket = Bucket,
    Key = Key,
    UploadId = upload_id
  )
})

s3$complete_multipart_upload(
  Bucket = Bucket,
  Key = Key,
  UploadId = upload_id,
  MultipartUpload = list(Parts = upload_no$etags)
)

Note: The buffer can't be less than 5MB.

DyfanJones avatar Aug 10 '22 11:08 DyfanJones

Note it isn't clean but that is what you need to do. If you like you can make a request to s3fs (an R implementation of s3fs based on the R package fs and using paws under the hood). I believe it would make a nice addition to s3_file_stream_out method.

DyfanJones avatar Aug 10 '22 11:08 DyfanJones

This functionality has been added to s3fs. So feel free to use that method or the method above :)

For completeness here is the s3fs code:

remotes::install_github("DyfanJones/s3fs")

library(s3fs)

s3_file_stream_out(
  "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf",
  "s3://mybucket/ieee_talk.pdf"
)

DyfanJones avatar Aug 10 '22 14:08 DyfanJones