aws-sdk-ruby icon indicating copy to clipboard operation
aws-sdk-ruby copied to clipboard

S3 put_object should accept a block to facillitate chunked writes

Open ezekg opened this issue 1 year ago • 7 comments

Describe the feature

After using get_object's chunked read, I assumed put_object similarly supported chunked writing:

client.put_object(bucket: blob.bucket, key: blob.key) do |buffer|
  while chunk = blob.read(16 * 256)
    buffer << chunk
  end
end

For reference, get_object supports this:

client.get_object(bucket: blob.bucket, key: blob.key) do |chunk|
  buffer << chunk
end

But this isn't currently supported and results in an empty object, since the block is ignored.

Use Case

I want to write an IO to S3 while maintaining a low memory footprint, while being explicit with how much I read for each chunk. I do not want to rely on S3 internals to choose how large my chunks should be.

Proposed Solution

Similarly to get_object, allow put_object to accept a block, yielding the internal request body.

Other Information

No response

Acknowledgements

  • [X] I may be able to implement this feature request
  • [ ] This feature might incur a breaking change

SDK version used

1.113.0

Environment details (OS name and version, etc.)

Linux 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

ezekg avatar Nov 14 '24 22:11 ezekg

Hi, have you looked at https://github.com/aws/aws-sdk-ruby/blob/version-3/gems/aws-sdk-s3/lib/aws-sdk-s3/customizations/object.rb#L385

mullermp avatar Nov 14 '24 23:11 mullermp

This issue has not received a response in 1 week. If you still think there is a problem, please leave a comment to avoid the issue from automatically closing.

github-actions[bot] avatar Nov 25 '24 00:11 github-actions[bot]

Hi, have you looked at https://github.com/aws/aws-sdk-ruby/blob/version-3/gems/aws-sdk-s3/lib/aws-sdk-s3/customizations/object.rb#L385

Thanks for this. I wasn't aware of that method. I'm curious if we could make put_object with a block delegate to upload_stream?

ezekg avatar Nov 25 '24 00:11 ezekg

I'm not sure if that's possible without a breaking change within the major version. The block is already reserved to be a response target here: https://github.com/aws/aws-sdk-ruby/blob/version-3/gems/aws-sdk-core/lib/seahorse/client/request.rb#L70. There would be no way to differentiate that a block is for reading or writing and would be inconsistent.

mullermp avatar Nov 25 '24 18:11 mullermp

I believe you can also pass an IO as the body for put_object and it will be read. I'll leave this as an open feature request but I think the interface would have to be different.

mullermp avatar Nov 25 '24 18:11 mullermp

I'm not sure if that's possible without a breaking change within the major version. The block is already reserved to be a response target here: https://github.com/aws/aws-sdk-ruby/blob/version-3/gems/aws-sdk-core/lib/seahorse/client/request.rb#L70. There would be no way to differentiate that a block is for reading or writing and would be inconsistent.

The put_object method does not currently take a block or pass it along to send_request, so I don't think introducing a block that is used for streaming writes would be a breaking change. I do understand that the internals of put_object would need to be refactored, but I don't see any apparent breaking changes for the public put_object API.

I am currently passing an IO body that streams the data as required (well as much as I can from the outside), just thought the block interface would be a nicer and clearer DX, since it'd align well with assumptions from using get_object.

ezekg avatar Nov 25 '24 19:11 ezekg

This could be done by checking streaming input modeling on the operation. However this could be an inconsistent API, where some operations have block streaming requests and others for responses. Additionally, writing from the block would be very complex - net http body writing would have to yield to the block and I believe that would be inefficient. Our current build request would need to differentiate block types. Currently the IO body is passed to net http's body stream and uses IO.copy_stream (written in C) and the stream is read in chunks already. I can leave this open as a feature request to consider.

mullermp avatar Nov 25 '24 19:11 mullermp