aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

S3 streaming with unknown content size

Open shorea opened this issue 7 years ago • 8 comments

Unclear what exactly this would mean, some investigation is required. But for uploads I imagine we'd do some buffering of the content, if it's greater than 5MB we would do a multipart upload with some retries of parts. For downloads this might be a special input stream returned to the caller that can reconnect on failure by doing a range GET to retrieve the rest of the bytes.

shorea avatar Sep 05 '17 16:09 shorea

I'd like to vote for automatic reconnect on download. So you'd have something that looks like a BufferedInputStream. In one mode each BufferSize chunk would be a GetRange and it goes right into the byte array. That way if the client is slow to process the stream they will not get any connection reset errors. With a good buffer size (maybe even a buffer array to load some buffers in parallel) it should be super fast while being very stable. If there is a Connection read error in the short time it takes to copy the range into the byte buffer then I would want it to use the client configured Retry configuration to retry.

rdifalco avatar Sep 06 '17 17:09 rdifalco

Support for retrying with non-resettable InputStreams would be nice. See https://github.com/aws/aws-sdk-java/issues/427

MikeFHay avatar Nov 10 '17 11:11 MikeFHay

Additionally, it looks to me like TransferManager doesn't support doing multi-part uploads of data with an unknown size. This makes it pretty useless for big data.

MikeFHay avatar Dec 12 '17 12:12 MikeFHay

Any updates on that issue?

laymain avatar Sep 28 '18 12:09 laymain

@laymain We are focusing on getting the SDK 2.0 ready with low-level clients. Most of the high level libraries (TransferManager, DynamoDB mapper) won't be available until early 2019.

varunnvs92 avatar Sep 28 '18 15:09 varunnvs92

I understand, thank you for the answer.

laymain avatar Sep 28 '18 16:09 laymain

Tracking in #37

millems avatar Jul 08 '19 23:07 millems

Adding a use case - code below is based on code currently available in TransferManager 2.17.210-PREVIEW... there is a 'fromPublisher' method that accepts Publisher <ByteBuffer> - so below is simple demo code and my env details.

Another use case, and more of a priority - uploading compressed content (or anything else) via java.nio.OutputStream. i.e. we will create a java GZipOutputStream - write to it using ByteBuffer writes - when we close the output stream, expect TransferManager to wrap things up regarding parts/content-length etc.

Hoping for code like: ... .requestBody(AsyncRequestBody.fromOutputStream(myOutputStream))

Here is demo of native error we get (shown in the commented code) - it's based on no content-length in the put object request.

This works when using AsyncRequestBody.fromXXX where it is something other than a Publisher (File or String). So goal would be to update the content length onClose(), and upload parts based on the S3ClientConfiguration settings.

/**
 * Java 17
 * s3-transfer-manager-2.17.210-PREVIEW.jar
 * sdk-core-2.17.210.jar
 * reactor-core-3.4.13.jar
 * Native CLIv2 on MacOS monterey: 
 *    aws-cli/2.7.7 Python/3.9.13 Darwin/21.5.0 source/x86_64
 */
class AWSTransferManagerTest {

    public static void main (String[] args) {
        S3ClientConfiguration s3TransferConfig = S3ClientConfiguration.builder()
                .targetThroughputInGbps(20.0)
                .minimumPartSizeInBytes(1024L)
                .build();
        S3TransferManager transferManager = S3TransferManager.builder()
                .s3ClientConfiguration(s3TransferConfig)
                .build();
        Flux<ByteBuffer> flux = Flux.just("one", "two", "three")
                .map(val -> ByteBuffer.wrap(val.getBytes()));
        //verify flux:
        //flux.subscribe(System.out::println);

        Log.initLoggingToFile(Log.LogLevel.Trace, "s3.logs");
        //output in s3.logs:
        // [INFO] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20 Initiating making of meta request
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3MetaRequest] - Could not create auto-ranged-put meta request; there is no Content-Length header present.
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20: Could not create new meta request.
        // [WARN] [2022-06-16T15:19:50Z] [0000700005d0f000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

        Upload upload =
                transferManager.upload(UploadRequest.builder()
                        .putObjectRequest(b -> b.bucket("bucket").key("tmp/flux.bin"))
                        .requestBody(AsyncRequestBody.fromPublisher(flux))
                        .overrideConfiguration(b -> b.addListener(LoggingTransferListener.create()))
                        .build());
        CompletedUpload completedUpload = upload.completionFuture().join();
    }
}

djchapm avatar Jun 21 '22 20:06 djchapm

Hey check this out - https://github.com/awslabs/aws-c-s3/pull/285 So using AWS S3 V2 API: S3AsyncClient.crtBuilder() and S3TransferManager, we may soon be able to achieve this 8 year old ticket! (8 years due to ticket chaining - all asking for a way to stream data to S3)

@yasminetalby / @zoewangg Can we maybe move up the priority on this considering the action on the CRT Native Client?

Thanks - also I have some updated code, the stuff above was created with s3 v2 beta code, locally I have updated for v2.20.51.

Regards!

djchapm avatar May 12 '23 21:05 djchapm

Hey @djchapm, yes! We are aware of this change and will bump CRT Java binding version once it's available. There should be no additional change needed on the Java SDK side.

zoewangg avatar May 12 '23 21:05 zoewangg

Looks like https://github.com/awslabs/aws-c-s3/pull/285 is merged - @zoewangg can you confirm which version of the sdk will contain this change?

djchapm avatar May 23 '23 14:05 djchapm

Hi @djchapm, the change is not available in the Java binding yet, we'll let you know once it's ready.

zoewangg avatar May 25 '23 20:05 zoewangg

@zoewangg The latest release of [email protected] supports the unknown content-length now.

TingDaoK avatar Jun 05 '23 18:06 TingDaoK

Submitted PR to bump CRT version #4065

zoewangg avatar Jun 05 '23 21:06 zoewangg

Hey @zoewangg - it's working great! I haven't done any performance comparisons but will report. Definitely reduces a ton of code though. So far I've only tried with a few MBs for relatively short durations. I'll want to try and keep things open for a few Gigs and hours before we conclude anything.

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream? I've tried digging into this and obtaining a handle to the metadata map but it is converted to an unmodifiable map in the UploadRequest, and I'm not sure where in the process the meta data is actually uploaded. If it is at the end though - do you think this would be a possibility?

Our only alternative right now is to do a copy to update metadata which in our experience can take from 1 to 20 minutes.

djchapm avatar Jun 08 '23 16:06 djchapm

I have an implementation that works, but I can't figure out what is taking so long once I mark the stream complete. I have a TransferListener hooked up and there is maybe only one more call to listener.bytesTransferred after I have closed. But then it seems to go into a black hole for about a minute before calling the transferComplete method.

I don't see any way to forcefully tell the Upload/s3Client/TransferManager that we're done and to get it over with. I can't really debug because it's all happening async in native-land.

Any thoughts? I'm only dealing with a 140MB upload. For our solution/metrics this amounts to about 7000 "messages" per second whereas the old solution in aws sdk 1 where we manage the multiparts and keep track of content size and then update the metadata after the fact using a copy - that whole process achieves 224000 messages per second in the same exact test. So this bottleneck at the end seems like a pretty big issue. :(

djchapm avatar Jun 15 '23 19:06 djchapm

Having some random failures etc... and some slowness in completing uploads. I uploaded a java test project and created issue: https://github.com/awslabs/aws-c-s3/issues/317 @zoewangg Would be cool if you could check the zip I uploaded - Not sure if the issue is in TransferManager implementation or in aws-c-s3.

djchapm avatar Jun 16 '23 22:06 djchapm

Hi @djchapm yeah, we will work with CRT team to investigate this issue.

To answer your previous question (sorry for the delay)

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream?

Just for my understanding, is there any reason you can't set the metadata in the original request? I think metadata is being sent in CreateMultipartUploadRequest, so it happens before streaming starts.

zoewangg avatar Jun 16 '23 22:06 zoewangg

Yea - we're uploading metadata/indicators to our files based on what's been processed in the file for disaster recovery scenarios in streaming. Those indicators are typically cumulative counts or 'last entry' markers in the object which can't be calculated ahead of time.

djchapm avatar Jun 20 '23 14:06 djchapm

~~Opened Feature Request to support using this feature via an java.io.OutputStream implementation by opening access to BlockingOutputStreamAsyncRequestBody: https://github.com/aws/aws-sdk-java-v2/issues/4119~~

My bad... it is open.

djchapm avatar Jun 20 '23 17:06 djchapm

Hey @zoewangg - I can confirm that this is working. I've done several rounds of testing with streaming to S3OutputStream of up to 17Gigs in size. Potential drawback for some people is that the meta data for content size is not attached to the object in S3 - might be good thing to add as feature request.

For anyone looking for example code you can see a temp solution I posted on related ticket https://github.com/awslabs/aws-c-s3/issues/317

Thanks!

djchapm avatar Jul 06 '23 19:07 djchapm

@djchapm glad that this is working for you.

Marking this to auto-close soon. Please let us know if there's any further questions.

debora-ito avatar Aug 28 '23 19:08 debora-ito

Is this issue supposed to be already fixed in the current AWS Java SDK (2.20.140), because in my tests today I still get an error S3Exception: You must provide the Content-Length HTTP header. (Service: S3, Status Code: 411... when not providing the content length in the PutObjectRequest.

guss77 avatar Sep 02 '23 22:09 guss77

I'm still seeing this error on software.amazon.awssdk:s3:2.20.63

AndreasChristianson avatar Feb 23 '24 16:02 AndreasChristianson