paws
paws copied to clipboard
Stream file directly to s3 bucket without downloading to local disk first
Thanks for the package. Really impressive work!
Question about paws functionality
I was wondering if it's currently possible to take a url, for example
https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf
And stream the file directly to an s3 bucket without ever downloading to the local disk.
What I've tried so far
As far as i can tell put_object
doesn't check if body is a url and so cannot find the file if the following is run:
svc$put_object(Body = "https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf", Bucket = "bucketname", Key = "ieee_talk.pdf")
Next thing i tried was to use a connection object as body of put_object
which obviously failed since Body must be a string
Basically, i'm looking for the paws equivalent of the following (uses aws cli)
curl <url> | aws s3 cp - keyname
This feature is particularly useful if the files you want to store on S3 are very large, since you avoid having to download/upload it twice - once to local then once more to s3
TLDR Is it currently possible to copy a remote file using its URL straight into s3 using paws?
Hi @selkamand,
Would something like this solve your issue?
s3 = paws::s3()
stream <- curl::curl_fetch_memory("https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf")
s3$put_object(Body = stream$content, Bucket = "bucketname", Key = "ieee_talk.pdf")
Thanks @DyfanJones
Use of curl::curl_fetch_memory
does avoid saving to disk - but if i understand correctly, requires the target file fits within RAM of local machine.
This is a limitation I would hope to avoid - something i really should have specified in the original question (my bad!)
Is there an alternative that would support streaming the remote file into s3 bucket?
If you don't want to do it 1 step i.e. download file to memory and upload you can do it using the multipart upload method. Something like this:
library(httr2)
library(paws)
Bucket = "your_bucket"
Key = "my_file"
upload_no <- new.env(parent = emptyenv())
upload_no$i <- 1
upload_no$parts <- list()
s3 <- paws::s3()
upload_id = s3$create_multipart_upload(
Bucket = Bucket, Key = Key,
)$UploadId
s3_upload_part <- function(x){
etag <- s3$upload_part(
Body = x,
Bucket = Bucket,
Key = Key,
PartNumber = upload_no$i,
UploadId = upload_id
)$ETag
upload_no$parts[[upload_no$i]] <- list(ETag = etag, PartNumber = upload_no$i)
upload_no$i <- upload_no$i + 1
return(NULL)
}
tryCatch(
resp <- request("your url") %>%
req_stream(s3_upload_part, buffer_kb = 5 * 1024)
error = function(e){
s3$abort_multipart_upload(
Bucket = Bucket,
Key = Key,
UploadId = upload_id
)
})
s3$complete_multipart_upload(
Bucket = Bucket,
Key = Key,
UploadId = upload_id,
MultipartUpload = list(Parts = upload_no$etags)
)
Note: The buffer can't be less than 5MB.
Note it isn't clean but that is what you need to do. If you like you can make a request to s3fs (an R implementation of s3fs based on the R package fs and using paws under the hood). I believe it would make a nice addition to s3_file_stream_out
method.
This functionality has been added to s3fs. So feel free to use that method or the method above :)
For completeness here is the s3fs code:
remotes::install_github("DyfanJones/s3fs")
library(s3fs)
s3_file_stream_out(
"https://ftp.ncbi.nlm.nih.gov/blast/demo/ieee_talk.pdf",
"s3://mybucket/ieee_talk.pdf"
)