aws-sdk-rust Multipart upload: read directly from file

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue, please leave a comment

Prerequisites

[X] I've searched for previous similar issues or discussions and didn't find any solution.
[X] I searched the Developer Guide for an answer to my question.

Question Description

I am currently developing a program that will (among various things) upload large files to S3.

I found how to perform normal uploads directly from files (without having to load them in RAM) in the doc, but I have an issue with multipart uploads (required for larger files).

If I understand things correctly, Client.put_object() will not perform a multipart upload and, unlike for other languages, the SDK does not provide a high-level API to do them automatically, you have to do them manually using Client.create_multipart_upload(), Client.upload_part() and Client.complete_multipart_upload().

The issue is that, while when performing a normal upload, I can just do .body(ByteStream::from_path("some/path"), it does not seem possible to do the same with a multipart upload. Since .body() must be called for each part, creating a new ByteStream from a path each time won't work (only the beginning of the file will be read) and I did not find a proper way to seek in a ByteStream (except reading individual bytes in a loop, which would lead to useless IO). The .take() function of the trait StreamExt sounds interesting, but its result is not usable as parameter for .body(), and simple trying to pass the same ByteStream at each iteration will not work, as the Rust compiler consider, likely for good reasons, that the ByteStream is "moved" when passing it to .body() (not to mention the fact that the function may not even know when to stop reading, since it is not clear for me whether .content_length() has an effect on stream reading). The only workaround I can imagine is to read the whole part from the file to RAM and use ByteStream::from(), but this is definitely not a proper way to perform an upload from a file.

So, did I miss something? Is there a proper way to perform a multipart upload reading data directly from a file, is there an issue with the API or is this a feature left for a future version?

I am not very experienced with Rust, and I am discovering the AWS SDK as well as the 3rd party libraries it relies on, so maybe it's just me who do not understand how to do things properly, but in that case, I would appreciate some help, in the other case, I hope this SDK could be improved to allow multipart uploads directly from a file.

Platform/OS/Device

GNU/Linux (amd64)

Language Version

Rust 1.59.0

Mar 22 '22 21:03 ALRBP

We haven't implemented any high-level APIs for S3 yet, so as you've discovered, this is manual and tedious right now. I will convert this issue into a feature request so that people can upvote it, and in the meantime, I'll see if we can throw together an example.

Mar 22 '22 22:03 jdisanti

Hi @Velfi and @jdisanti, out of curiosity, I started this trivial example (with only one part) (can wire up Rust logic that calls upload_part for each part of bytes read from large file after this is basic example)

pub async fn upload_object_with_multipart(
    client: &Client,
    bucket_name: &str,
    file_path: &str,
    object_key: &str,
) -> Result<(), Box<dyn Error>> {

    /// instantiate a multipart upload
    let create_multipart_upload_response = client
        .create_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .send()
        .await?;

    /// grab the upload_id that we'll use when uploading parts
    let upload_id = match create_multipart_upload_response.upload_id() {
        Some(upload_id) => upload_id.to_owned(),
        None => panic!("Something went wrong when creating multipart upload."),
    };

    /// upload a file in parts
    // TODO: after this trivial example works, demonstrate this with actual multiple parts, not just as one part (single byte stream)
    let body = match ByteStream::from_path(Path::new(file_path)).await {
        Ok(byte_stream) => byte_stream,
        Err(e) => panic!("{}", e),
    };

    let _upload_part_response = client
        .upload_part()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .part_number(1)
        .body(body)
        .send()
        .await?;

    let list_parts_response = client
        .list_parts()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .send()
        .await?;

    let parts_list = list_parts_response.parts();

    /// complete the multipart upload
    client
        .complete_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(upload_id)
        // .body(parts_list) // ? .body() does not exist on fluent_builders::CompleteMultipartUpload, even though body is required according to AWS S3 CompleteMultipartUpload documentation
        .send()
        .await?;

    println!("Uploaded file in parts: {}", file_path);
    Ok(())
}

Is there a way to set the POST body on S3 client's complete_multipart_upload builder? If not, is that necessary pre-work before creating an example with the AWS S3 Rust SDK is actionable?

Without the ability to set the HTTP POST body on the underlying HTTP call, the AWS S3 API will respond with an error like:

Error: ServiceError { err: CompleteMultipartUploadError { kind: Unhandled(Error { code: Some("InvalidRequest"), message: Some("You must specify at least one part"),

Source: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_RequestSyntax

If producing an example is workable, I'd be happy to help do so. Please let me know if I'm missing some obvious method that I should be using instead.

P.S., by searching this error message on stackoverflow, it is easy to find that this caused some confusion for other language SDKs (for example PHP: https://stackoverflow.com/a/31413193/18257455 ), however those developers were able to resolve by setting an Array of parts to a MultipartUpload method that is available on PHP's completeMultipartUpload builder chain.

In this Rust SDK, there does not (at a glance) appear to be a way for an SDK user to construct the CompletedMultipartUpload object that looks to be equivalent method that "should" be able to be set to ultimately satisfy the AWS S3 API requirement (the POST request body for a CompleteMultipartUpload action must be set as described in link above).

If .multipart_upload(_) is the method on this complete_multipart_upload builder that enables us to set the list of parts that is required to send this HTTP POST request to the AWS S3 API, then the question remains:

How do we construct a CompletedMultipartUpload (note the ed) struct? (It is marked as non-exhaustive so must rely on a builder pattern, and cannot use struct literal syntax to instantiate?)

Thanks!

Apr 25 '22 19:04 darrow-olykos

I believe this is the field you are missing: https://docs.rs/aws-sdk-s3/0.10.1/aws_sdk_s3/client/fluent_builders/struct.CompleteMultipartUpload.html#method.multipart_upload

it will be serialized into the body

Apr 26 '22 14:04 rcoh

@rcoh Thanks! I took a look at the method again and see now how the SDK's fluent builder are structured. While this is a bit verbose, I at least have a working multipart example with one part:

(See https://github.com/awslabs/aws-sdk-rust/issues/494#issuecomment-1108928236 for the section of code that does the upload)

// just showing the parts relevant for the HTTP POST body serialization

/// get the list of uploaded parts
let parts_list = list_parts_response.parts().unwrap();

/// construct vec of CompletedParts (required to build CompletedMultipartUpload)
let completed_parts: Vec<aws_sdk_s3::model::CompletedPart> = parts_list
    .iter()
    .map(|part| {
        aws_sdk_s3::model::CompletedPart::builder()
            .e_tag(part.e_tag.as_ref().unwrap())
            .part_number(part.part_number)
            .build()
    })
    .collect();

/// construct CompletedMultipartUpload (required to construct CompleteMultipartUpload in a way that correctly serializes part list to HTTP POST body
let completed_multipart_upload = aws_sdk_s3::model::CompletedMultipartUpload::builder()
      .set_parts(Some(completed_parts))
      .build();

/// complete the multipart upload
client
    .complete_multipart_upload()
    .bucket(bucket_name)
    .key(object_key)
    .upload_id(upload_id)
    .multipart_upload(completed_multipart_upload)
    .send()
    .await?;

S3 client version: 0.10.1
Region:            us-east-1
Bucket:            test-bucket-for-multipart-1919
Object Key:        abcdef123123
File uploaded:     ./crates/s3/create_object/tests/testfile.txt

Uploaded file in parts: ./crates/s3/create_object/tests/testfile.txt

Next, this example can be cleaned up to demonstrate the Rust way of reading bytes of file incrementally and what that could look like. (My trivial multipart example only uses one "part")

If its OK, I'd like to take a stab at that tonight or tomorrow

Apr 26 '22 14:04 darrow-olykos

nice! yeah this looks good so far. Note that the eventual home for this is https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/rust_dev_preview

the examples in this repo get built from their at release / codegen time

The multi-part will be a bit tricky until we add support for https://github.com/awslabs/aws-sdk-rust/pull/519 — if you were interested in making a PR for that we'd gladly accept it.

Apr 26 '22 15:04 rcoh

@rcoh Thank you for that info! Just curious, would you know someone who might know what the relationship is between aws_sdk_s3 and aws_smithy_http?

Context

Objective: Implement example AWS S3 Rust SDK Multipart upload (potentially in a simpler way than what I see in #519?) (Edit: I see now that the file-seeking logic that I see in #519 is what an SDK consumer will need to write if the SDK maintainer's goal is to add this from_file_chunk method on to ByteStream.)

Actions: I have an example of using AWS S3 SDK's ByteStream to read from a file and using StreamExt's try_next() to call S3's upload_part operation for each part, and I think things are OKAY except for the fact that the default buffer size is 4096 bytes, and the minimum part size that S3 API accepts is 5 MB.

Significant detail: When inspecting source code of various crates, I noticed that this ByteStream shows up in two different AWS crates:

https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/types/struct.ByteStream.html
https://docs.rs/aws-smithy-http/0.41.0/aws_smithy_http/byte_stream/struct.ByteStream.html

Why is this significant: This seems significant to my objective because only one of these two otherwise identical ByteStream structs has a read_from method available which reveals an FsBuilder which is the type that enables me to override the default buffer size for the ByteStream to be something other than the default 4096 bytes. (I hope to use this to specify a size greater than the minimum 5 MB that AWS S3 UploadPart API requires). And the ByteStream struct that has this read_from method is not the ByteStream exposed by aws_sdk_s3.

Hopefully that makes sense, just started looking at this- I know AWS Rust SDK is only in preview mode but I'm excited to see it get used more!

Thanks!

May 01 '22 03:05 darrow-olykos

Oh, whoops. You've answered this Q in #519: I see that aws-smithy-http is the source of truth and that aws-sdk-s3 is generated from it. I also see that the author of #519 added a from_file_chunk method to the ByteStream (just in aws-sdk-s3.)

May 01 '22 03:05 darrow-olykos

the ByteStream struct is the same, it's just re-exported for convenience. aws_smithy_http is the crate which contains the core HTTP abstractions that the SDK is built on—SDK crates use it to actually send/receive HTTP requests & re-export some of its types for convenience.

May 02 '22 18:05 rcoh

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

May 03 '22 18:05 Velfi

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

Great! I think this is exactly what is needed.

May 03 '22 19:05 ALRBP

It'd be nice if the multipart upload took an AsyncRead instead of a file.

The AsyncRead would work with raw files and also allow compressing data from a file as a lazy stream when uploading to s3.

Apr 30 '23 04:04 forficate

aws-sdk-rust aws-sdk-rust copied to clipboard

Multipart upload: read directly from file

Community Note

Prerequisites

Question Description

Platform/OS/Device

Language Version

aws-sdk-rust
aws-sdk-rust copied to clipboard