aws-sdk-go-v2 icon indicating copy to clipboard operation
aws-sdk-go-v2 copied to clipboard

Upload manager: When using `io.Reader` without seeking support, the uploader over allocates memory leading to high usage

Open erezrokah opened this issue 1 year ago • 4 comments

Acknowledgements

  • [X] I have searched (https://github.com/aws/aws-sdk/issues?q=is%3Aissue) for past instances of this issue
  • [X] I have verified all of my SDK modules are up-to-date (you can perform a bulk update with go get -u github.com/aws/aws-sdk-go-v2/...)

Describe the bug

This is basically the same as https://github.com/aws/aws-sdk-go-v2/issues/1302, but I think there's a possible optimization to be done to improve memory usage.

When uploading small files that don't require multi-part uploads and using a reader that doesn't support seeking, the uploader code always allocates at least a 5MB buffer (named partSize in the code) to read data from the buffer. For example if you upload a 10KB file, 5MB will get allocated. If you upload 200 10KB files in parallel, ~1GB of memory will be allocated, instead of ~2MB that's needed.

The 5MB limitation comes from the minimal upload size for multipart files, but I don't see a reason to couple it with the buffer that's used for reading from the stream.

Expected Behavior

Only allocate the memory needed for uploading the file if it's under 5MB

Current Behavior

Memory consumption is unnecessarily high

Reproduction Steps

https://github.com/aws/aws-sdk-go-v2/issues/1302 has those in detail

Possible Solution

Solution 1. Use a LimitReader to read up to partSize for the first read instead of using the pool. If it's below 5MB don't use the pool at all, if it's above revert to old behavior, see proposal here

Solution 2. Drop the pool altogether and let GC manage memory. Always use LimitReader to read up to partSize and let it deal with allocations. The thing is with the pool is that it's not cleared until the upload finishes, so let's say you have concurrency 5 and part size 100MB, that memory will not be cleared until the end of the upload (assuming you have a file bigger than 500MB). Let's say one of the last parts of the upload is slow for some reason, the memory for the already uploaded parts will still be held. The GC should be smart enough (I hope) to re-use buffers and clear them if another part of the process needs them more.

Additional Information/Context

Our current workaround is to do:

// Non seeker reader
allData, err := io.ReadAll(r)
if err != nil {
	return err
}
// Pass this to uploader
readerSeeker := bytes.NewReader(allData)

But that has the downside of allocating a large byte array for large files

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.16.24

Compiler and Version used

go version go1.22.2 darwin/arm64

Operating System and version

MacOS 14.5 (23F79)

erezrokah avatar Jun 24 '24 15:06 erezrokah

Filing as a feature request, will likely be OBE as we move towards a redo of the transfer manager #2649.

lucix-aws avatar Jun 24 '24 16:06 lucix-aws

Almost a year has passed, any news here @lucix-aws ?

candiduslynx avatar Jul 07 '25 17:07 candiduslynx

We're actively working on the v2 transfer manager, you can see some of that work in recent PRs. We plan to do performance testing before we GA that module and make sure this is addressed.

lucix-aws avatar Jul 08 '25 16:07 lucix-aws

@lucix-aws while you're addressing the issue in new & shiny thing ™️ , could the team spend some time reviewing this (link from the original issue) and fixing it for the current users? While it was OK to wait several months after the issue was acknowleged, this is getting ridiculous.

candiduslynx avatar Jul 08 '25 19:07 candiduslynx