aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

Add `InputStreamSubscriber.transferTo(OutputStreamPublisher)` optimization

Open fanyang01 opened this issue 1 year ago • 2 comments

Describe the feature

TL;DR: Making InputStreamSubscriber implement a specialized InputStream#transferTo(OutputStream out) method that detects whether out is an OutputStreamPublisher and directly passes the ByteBuffers to it if so (instead of copying them).

Currently, InputStreamSubscriber inherits the transferTo(OutputStream out) method from java.io.InputStream, which uses an internal buffer to copy data. It works with usual synchronous output streams, such as System.out or ByteArrayOutputStream, but not with asynchronous ones such as OutputStreamPublisher, which is used by BlockingOutputStreamAsyncRequestBody. The reason is that OutputStreamPublisher#write(byte[]) does not copy the buffer, but just passes it to the subscriber via ByteBuffer#wrap(byte[]). Therefore, the buffer may be modified by subsequent InputStream#read(byte[] buffer) before the subscriber reads it, which leads to unexpected behavior.

To avoid this issue, the simplest solution is to do something like out.write(in.readAllBytes()). This is memory-consuming for a long input stream.

Use Case

I want to open an S3 object as an InputStream in, read a header part of it, modify something, and then write the modified header plus the remaining unchanged content to a new S3 object using the putObject(BlockingOutputStreamAsyncRequestBody) interface. It is straightforward to use in.transferTo(out) to copy the unchanged content. But currently, I have to use out.write(in.readAllBytes()) instead for correctness.

That is to say, one has to be very careful when using BlockingOutputStreamAsyncRequestBody. I encountered some cases that the data written to S3 was malformed due to this issue. Unfortunately, such behavior is not well-documented.

Implementing this feature will make such use cases safer and more performant.

Proposed Solution

No response

Other Information

No response

Acknowledgements

  • [ ] I may be able to implement this feature request
  • [ ] This feature might incur a breaking change

AWS Java SDK version used

2.20.19

JDK version used

OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1

Operating System and version

Darwin Kernel Version 22.5.0

fanyang01 avatar Jun 09 '23 07:06 fanyang01

@fanyang01 thank you for the report. We are considering this a bug.

debora-ito avatar Jul 18 '23 21:07 debora-ito

@fanyang01 I recently ran into the corruption issue you mentioned while trying to use BlockingOutputStreamAsyncRequestBody (which I'm now calling BOSARB). I have a bit of code that reads data from a JDBC ResultSet, and writes it directly to an OutputStream in various different formats (csv, json, etc). Trying to do that with BOSARB was an exercise in frustration. I ultimately pivoted to using BISARB and a PipedInputStream/PipedOutputStream pair with a separate thread writing to the output stream.

raelik avatar Jan 25 '24 06:01 raelik