br storage: refactor UploadWriter and implements part size inflation

storage: refactor UploadWriter and implements part size inflation

Open kennytm opened this issue 4 years ago • 3 comments

What problem does this PR solve?

It was previously found that Dumpling cannot upload files larger than 50 GB to S3. This is because we used multi-part upload to S3 with each part being 5 MB, but AWS S3 only allows up to 10,000 parts, so data beyond 50 GB will fail with "Part number must be an integer between 1 and 10000, inclusive".

What is changed and how it works?

Here we implement "part size inflation" to exponentially increase the size of each part as we write more data. Every part is larger than the previous part by 0.0654% (configurable). With small data, the part size is very close to the optimal size of 5 MB, but later ones will gradually increase, and the exponential increase ensures that after the 10,000th part the inflation reaches 688 × 5 MB and we can serve a total file size up to 5 TB, the maximum size allowed by S3.

In this PR we also refactored the UploadWriter so that the part size can be accurately controlled:

the functionality of noCompressionBuffer is merged entirely into simpleCompressBuffer by a no-op compress writer.
uploadChunk is now controlled by the size of compressed buffer rather than data input, so every part is accurately 5 MiB on S3 (this also reduces number of parts).
the options to NewUploadWriter are collected into a struct since we are going to have too many arguments.

Check List

Tests

Unit test

Code changes

Has exported function/method change
- NewUploadWriter's signature is entirely changed.

Side effects

Possible performance regression
- Part size inflation means that, towards the end (around n=4500), we will be trying to upload hundreds of megabytes to S3 as a single part. This is prone to network failure (but there's probably nothing we could do besides retrying...)

Related changes

Release Note

(Dumpling) now supports writing files more than 50 GB to AWS S3.

Nov 16 '20 18:11 kennytm

/run-all-tests

Nov 24 '20 08:11 glorv

@kennytm: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 24 '21 16:03 ti-chi-bot

@kennytm please resolve the conflicts

Apr 21 '21 06:04 lichunzhu

br br copied to clipboard

storage: refactor UploadWriter and implements part size inflation

What problem does this PR solve?

What is changed and how it works?

Check List

Release Note

br
br copied to clipboard