aws-sdk-rust
aws-sdk-rust copied to clipboard
Multipart upload: read directly from file
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue, please leave a comment
Prerequisites
- [X] I've searched for previous similar issues or discussions and didn't find any solution.
- [X] I searched the Developer Guide for an answer to my question.
Question Description
I am currently developing a program that will (among various things) upload large files to S3.
I found how to perform normal uploads directly from files (without having to load them in RAM) in the doc, but I have an issue with multipart uploads (required for larger files).
If I understand things correctly, Client.put_object()
will not perform a multipart upload and, unlike for other languages, the SDK does not provide a high-level API to do them automatically, you have to do them manually using Client.create_multipart_upload()
, Client.upload_part()
and Client.complete_multipart_upload()
.
The issue is that, while when performing a normal upload, I can just do .body(ByteStream::from_path("some/path")
, it does not seem possible to do the same with a multipart upload. Since .body()
must be called for each part, creating a new ByteStream from a path each time won't work (only the beginning of the file will be read) and I did not find a proper way to seek in a ByteStream (except reading individual bytes in a loop, which would lead to useless IO). The .take()
function of the trait StreamExt sounds interesting, but its result is not usable as parameter for .body()
, and simple trying to pass the same ByteStream at each iteration will not work, as the Rust compiler consider, likely for good reasons, that the ByteStream is "moved" when passing it to .body()
(not to mention the fact that the function may not even know when to stop reading, since it is not clear for me whether .content_length()
has an effect on stream reading). The only workaround I can imagine is to read the whole part from the file to RAM and use ByteStream::from()
, but this is definitely not a proper way to perform an upload from a file.
So, did I miss something? Is there a proper way to perform a multipart upload reading data directly from a file, is there an issue with the API or is this a feature left for a future version?
I am not very experienced with Rust, and I am discovering the AWS SDK as well as the 3rd party libraries it relies on, so maybe it's just me who do not understand how to do things properly, but in that case, I would appreciate some help, in the other case, I hope this SDK could be improved to allow multipart uploads directly from a file.
Platform/OS/Device
GNU/Linux (amd64)
Language Version
Rust 1.59.0
We haven't implemented any high-level APIs for S3 yet, so as you've discovered, this is manual and tedious right now. I will convert this issue into a feature request so that people can upvote it, and in the meantime, I'll see if we can throw together an example.
Hi @Velfi and @jdisanti, out of curiosity, I started this trivial example (with only one part
) (can wire up Rust logic that calls upload_part
for each part of bytes read from large file after this is basic example)
pub async fn upload_object_with_multipart(
client: &Client,
bucket_name: &str,
file_path: &str,
object_key: &str,
) -> Result<(), Box<dyn Error>> {
/// instantiate a multipart upload
let create_multipart_upload_response = client
.create_multipart_upload()
.bucket(bucket_name)
.key(object_key)
.send()
.await?;
/// grab the upload_id that we'll use when uploading parts
let upload_id = match create_multipart_upload_response.upload_id() {
Some(upload_id) => upload_id.to_owned(),
None => panic!("Something went wrong when creating multipart upload."),
};
/// upload a file in parts
// TODO: after this trivial example works, demonstrate this with actual multiple parts, not just as one part (single byte stream)
let body = match ByteStream::from_path(Path::new(file_path)).await {
Ok(byte_stream) => byte_stream,
Err(e) => panic!("{}", e),
};
let _upload_part_response = client
.upload_part()
.bucket(bucket_name)
.key(object_key)
.upload_id(&upload_id)
.part_number(1)
.body(body)
.send()
.await?;
let list_parts_response = client
.list_parts()
.bucket(bucket_name)
.key(object_key)
.upload_id(&upload_id)
.send()
.await?;
let parts_list = list_parts_response.parts();
/// complete the multipart upload
client
.complete_multipart_upload()
.bucket(bucket_name)
.key(object_key)
.upload_id(upload_id)
// .body(parts_list) // ? .body() does not exist on fluent_builders::CompleteMultipartUpload, even though body is required according to AWS S3 CompleteMultipartUpload documentation
.send()
.await?;
println!("Uploaded file in parts: {}", file_path);
Ok(())
}
Is there a way to set the POST body on S3 client's complete_multipart_upload
builder?
If not, is that necessary pre-work before creating an example with the AWS S3 Rust SDK is actionable?
Without the ability to set the HTTP POST body on the underlying HTTP call, the AWS S3 API will respond with an error like:
Error: ServiceError { err: CompleteMultipartUploadError { kind: Unhandled(Error { code: Some("InvalidRequest"), message: Some("You must specify at least one part"),
Source: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_RequestSyntax
If producing an example is workable, I'd be happy to help do so. Please let me know if I'm missing some obvious method that I should be using instead.
P.S., by searching this error message on stackoverflow, it is easy to find that this caused some confusion for other language SDKs (for example PHP: https://stackoverflow.com/a/31413193/18257455 ), however those developers were able to resolve by setting an Array of parts to a MultipartUpload
method that is available on PHP's completeMultipartUpload
builder chain.
In this Rust SDK, there does not (at a glance) appear to be a way for an SDK user to construct the CompletedMultipartUpload
object that looks to be equivalent method that "should" be able to be set to ultimately satisfy the AWS S3 API requirement (the POST
request body for a CompleteMultipartUpload
action must be set as described in link above).
If .multipart_upload(_)
is the method on this complete_multipart_upload
builder that enables us to set the list of parts
that is required to send this HTTP POST request to the AWS S3 API, then the question remains:
How do we construct a CompletedMultipartUpload
(note the ed) struct? (It is marked as non-exhaustive so must rely on a builder pattern, and cannot use struct literal syntax to instantiate?)
Thanks!
I believe this is the field you are missing: https://docs.rs/aws-sdk-s3/0.10.1/aws_sdk_s3/client/fluent_builders/struct.CompleteMultipartUpload.html#method.multipart_upload
it will be serialized into the body
@rcoh Thanks! I took a look at the method again and see now how the SDK's fluent builder are structured. While this is a bit verbose, I at least have a working multipart
example with one part:
(See https://github.com/awslabs/aws-sdk-rust/issues/494#issuecomment-1108928236 for the section of code that does the upload)
// just showing the parts relevant for the HTTP POST body serialization
/// get the list of uploaded parts
let parts_list = list_parts_response.parts().unwrap();
/// construct vec of CompletedParts (required to build CompletedMultipartUpload)
let completed_parts: Vec<aws_sdk_s3::model::CompletedPart> = parts_list
.iter()
.map(|part| {
aws_sdk_s3::model::CompletedPart::builder()
.e_tag(part.e_tag.as_ref().unwrap())
.part_number(part.part_number)
.build()
})
.collect();
/// construct CompletedMultipartUpload (required to construct CompleteMultipartUpload in a way that correctly serializes part list to HTTP POST body
let completed_multipart_upload = aws_sdk_s3::model::CompletedMultipartUpload::builder()
.set_parts(Some(completed_parts))
.build();
/// complete the multipart upload
client
.complete_multipart_upload()
.bucket(bucket_name)
.key(object_key)
.upload_id(upload_id)
.multipart_upload(completed_multipart_upload)
.send()
.await?;
S3 client version: 0.10.1
Region: us-east-1
Bucket: test-bucket-for-multipart-1919
Object Key: abcdef123123
File uploaded: ./crates/s3/create_object/tests/testfile.txt
Uploaded file in parts: ./crates/s3/create_object/tests/testfile.txt
Next, this example can be cleaned up to demonstrate the Rust way of reading bytes of file incrementally and what that could look like. (My trivial multipart example only uses one "part")
If its OK, I'd like to take a stab at that tonight or tomorrow
nice! yeah this looks good so far. Note that the eventual home for this is https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/rust_dev_preview
the examples in this repo get built from their at release / codegen time
The multi-part will be a bit tricky until we add support for https://github.com/awslabs/aws-sdk-rust/pull/519 — if you were interested in making a PR for that we'd gladly accept it.
@rcoh Thank you for that info! Just curious, would you know someone who might know what the relationship is between aws_sdk_s3
and aws_smithy_http
?
Context
Objective: Implement example AWS S3 Rust SDK Multipart upload (potentially in a simpler way than what I see in #519?) (Edit: I see now that the file-seeking logic that I see in #519 is what an SDK consumer will need to write if the SDK maintainer's goal is to add this from_file_chunk
method on to ByteStream
.)
Actions: I have an example of using AWS S3 SDK's ByteStream
to read from a file and using StreamExt
's try_next()
to call S3's upload_part
operation for each part, and I think things are OKAY except for the fact that the default buffer size is 4096 bytes
, and the minimum part size that S3 API accepts is 5 MB
.
Significant detail:
When inspecting source code of various crates, I noticed that this ByteStream
shows up in two different AWS crates:
- https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/types/struct.ByteStream.html
- https://docs.rs/aws-smithy-http/0.41.0/aws_smithy_http/byte_stream/struct.ByteStream.html
Why is this significant:
This seems significant to my objective because only one of these two otherwise identical ByteStream
structs has a read_from
method available which reveals an FsBuilder
which is the type that enables me to override the default buffer size for the ByteStream to be something other than the default 4096
bytes. (I hope to use this to specify a size greater than the minimum 5 MB
that AWS S3 UploadPart API requires). And the ByteStream
struct that has this read_from
method is not the ByteStream
exposed by aws_sdk_s3
.
Hopefully that makes sense, just started looking at this- I know AWS Rust SDK is only in preview mode but I'm excited to see it get used more!
Thanks!
Oh, whoops. You've answered this Q in #519: I see that aws-smithy-http
is the source of truth and that aws-sdk-s3
is generated from it. I also see that the author of #519 added a from_file_chunk
method to the ByteStream
(just in aws-sdk-s3
.)
the ByteStream
struct is the same, it's just re-exported for convenience. aws_smithy_http
is the crate which contains the core HTTP abstractions that the SDK is built on—SDK crates use it to actually send/receive HTTP requests & re-export some of its types for convenience.
I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream
from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.
I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a
ByteStream
from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.
Great! I think this is exactly what is needed.
It'd be nice if the multipart upload took an AsyncRead
instead of a file.
The AsyncRead would work with raw files and also allow compressing data from a file as a lazy stream when uploading to s3.