aws-sdk-js-v3 icon indicating copy to clipboard operation
aws-sdk-js-v3 copied to clipboard

lib-storage Upload does not retry if one part fails.

Open alvaromat opened this issue 3 years ago • 19 comments

Describe the bug

I'm using Upload from lib-storage to get advantage of multipart uploads regarding large files. I would expect it to retry in case any of failure uploading any part. Imagine we have a 5TB file that is splitted in 5MB parts. That would result in many, many parts, so the chance of individual failure is high.

There is the leavePartsOnError option. Setting it to true interrupts the whole upload if any part fails.

Your environment

SDK version number

@aws-sdk/[email protected]

Is the issue in the browser/Node.js/ReactNative?

Browser

Details of the browser/Node.js/ReactNative version

  Browsers:
    Chrome: 90.0.4430.85
    Firefox: 78.0.2
    Safari: 14.0.3

Steps to reproduce

Upload a file using Upload, exactly like in the example:

       try {
          const upload = new Upload({
            client: s3,
            params: uploadParams,
            leavePartsOnError: false
          })

          upload.on('httpUploadProgress', progress => {
            console.log(progress)
          })

          const uploadResult = await upload.done()
          console.log('done uploading', uploadResult)
        }
        catch (err) {
          showErrors('There was an error uploading your photo: ', err)
        }

Observed behavior

If there are no errors with any part, the upload goes right. If there is an error in one or more parts, that part is skipped and the final file is built without those parts. This results in a corrupt file. This can be reproduced stopping the connection (turning WiFi off) or using browser DevTools to stop the connection.

Expected behavior

If a single part fails to upload, what is likely to happen with big files in mobile environments, just retry to send that part. This could be done for example here: https://github.com/aws/aws-sdk-js-v3/blob/28008c2a3e30cd234f447484d76777d5db56dad1/lib/lib-storage/src/Upload.ts#L124-L130

Do not send the final build command if any part is missing: https://github.com/aws/aws-sdk-js-v3/blob/28008c2a3e30cd234f447484d76777d5db56dad1/lib/lib-storage/src/Upload.ts#L152-L162

alvaromat avatar Apr 28 '21 15:04 alvaromat

@alvaromat you bring up a really good point. I agree that this is a much better experience. However - this isn't how upload is designed today so I'll change this to a feature request. As you point out, this isn't a difficult change, and it could also be a good first PR for new contributors!

In the mean time I'd love to open it up for public discussion. I'm looking to get a sense of what the retry would look like. No right or wrong answers, just looking to get a sense of where people are at in their thinking.

  1. Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?
  2. Lets say multiple parts fail - do we gate the total number of retries?
  3. How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

alexforsyth avatar May 10 '21 16:05 alexforsyth

Assigning @ajredniwja to support the discussion

alexforsyth avatar May 10 '21 16:05 alexforsyth

  1. Lets say a part fails - how many times would you expect that part to be retried? Should that be configurable?
  2. Lets say multiple parts fail - do we gate the total number of retries?
  3. How do we balance fast failing vs adequate retrying? What's best for the customer? What if i'm in an area of bad cell reception? What if something is up with the file/config/whatever and retrying a million times is just going to delay the customer?

The simplest version would just retry multiple times with a backoff. I'd set a max number of configurable retries per part, as multiple parallel parts can fail at the same time. The backoff could grow in a fibonacci schema to avoid fast failing: first failure waits 3 seconds, next waits 5, next 8... With a default number of retries of 5, it would wait for more than one minute before failing.

It's also possible to get advantage of the offline/online events. This could be useful for long offline events that could happen in low connection environments, like getting into the subway. For example, if the offline event is fired, we don't start new requests until the online event is triggered. There is also the navigator.onLine feature, supported also by older browsers.

I'd apply this retry logic to only parts uploads, not in the initial request where configuration issues are more likely to happen.

alvaromat avatar May 17 '21 08:05 alvaromat

Retry uploads of failed multiparts apparently was a feature of AWS SDK v2. Now with lib-storage it doesn't have it, so it's a regression.

Also, here's what AWS recommends:

Enable a retry mechanism in the application making requests

Because of the distributed nature of Amazon S3, requests that return 500 or 503 errors can be retried. It's a best practice to build retry logic into applications that make requests to Amazon S3. We recommend that you set the retry count for your applications to 10.

All AWS SDKs have a built-in retry mechanism with an algorithm that uses exponential backoff. This algorithm implements increasingly longer wait times between retries for consecutive error responses. Most exponential backoff algorithms use jitter (randomized delay) to prevent successive collisions. For more information, see Error Retries and Exponential Backoff in AWS.

There are 500 errors from time to time when uploading multi-part files and currently, none of JS AWS maintained libraries handle them properly. Quick search shows how common and frustrating those are with no solution offered: https://www.google.com/search?q=s3+multipart+upload+error+500

@alvaromat I would argue not to use navigator.onLine, as then it would leave ability to use this code in different environments, not just browser (next.js, react-native(web), node, etc.)

All that's needed to retry is just retry failed parts for maxRetryAttempts with a delay of a second or so.. Since it's retrying individual parts to recover from an occasional error 500, or to survive a short drop of connectivity, the exponential backoff is really not required.

mkrn avatar Sep 14 '21 18:09 mkrn

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

mifi avatar Oct 10 '21 06:10 mifi

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

@mifi This has also been my experience of using the @aws-sdk/lib-storage upload method with leavePartsOnError=false. It can result in corrupt binaries with missing parts.

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

I agree with @mifi that the default should be leavePartsOnError=true as from what @mifi says by letting upload throw the error it will be caught by the S3 client which will perform the retry. Even without the client retry leavePartsOnError=true as default is better because at least the exception gets thrown so you know that parts are failing and can retry the whole file upload rather than getting corrupt files from what appear to be successful upload executions.

If this is going to be the behaviour going forward a rename of the leavePartsOnError property or/and a rewording of the example on https://github.com/aws/aws-sdk-js-v3/tree/main/lib/lib-storage leavePartsOnError: false, // optional manually handle dropped parts would be welcomed as I find the current documented example confusing/misleading.

ThompsonGitHub avatar Jan 27 '22 00:01 ThompsonGitHub

Has there been any update on this - is there a way to actually manually handle the dropped parts?

danielvouch avatar Sep 13 '22 10:09 danielvouch

hello, Is there any progress on this feature?

lukerlv avatar Feb 07 '23 02:02 lukerlv

created a PR #4414

mifi avatar Feb 07 '23 09:02 mifi

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

@mifi hey man,can u show me the code? I don't know how to write, thank you very much!

lukerlv avatar Feb 07 '23 09:02 lukerlv

not sure which code? just set leavePartsOnError: true

mifi avatar Feb 07 '23 10:02 mifi

not sure which code? just set leavePartsOnError: true

ok, thank you @mifi

lukerlv avatar Feb 16 '23 01:02 lukerlv

@RanVaknin Is there any way to get this assigned a higher priority?

While there is a workaround available, it's not obvious that it needs to be used until something fails in a potentially catastrophic way, and teams spend potentially hours investigating.

The documentation pointed to above suggests to a user reading it that the default behaviour is the safe one with retries.

macdja38 avatar Mar 27 '23 21:03 macdja38

If I understand it correctly, the Upload module is designed so that (with the default behavior leavePartsOnError=false) if one part fails to upload, the whole upload will finish but without that part, creating a corrupt file and the promise will resolve with success. To me this seems like a very scary default behavior, and I would think that leavePartsOnError=true would be a safer default, or am I missing something?

Update: If I use leavePartsOnError: true, it actually retries before failing with @aws-sdk/client-s3 3.36.0 when using the S3-client from @aws-sdk/client-s3, due to the built in retry strategy in the client.

after set "leavePartsOnError: true" in @aws-sdk/lib-storage. how to set retry ? I want a part retry upload 3 times when upload fail. hope you anwser @mifi

0zcl avatar Sep 26 '23 08:09 0zcl

I think it will auto retry, see https://github.com/aws/aws-sdk-js-v3/issues/2311#issuecomment-939413928

mifi avatar Sep 26 '23 09:09 mifi

I think it will auto retry, see #2311 (comment)

when i upload video using @aws-sdk/lib-storage and set "leavePartsOnError: true". During the upload process, i set network unconnect. as follow image you see, only 1 error " partNumber=7 upload fail " showing on network. if client-s3 will retry, may be i can see more than 1 error from network。hope you anwser@mifi

image

0zcl avatar Sep 26 '23 10:09 0zcl

@0zcl hey man, have you solved the problem?

lukerlv avatar Nov 13 '23 10:11 lukerlv

@0zcl嘿,伙计,你解决问题了吗?

no, I can not set retry times. Although does not affect the use

0zcl avatar Nov 13 '23 10:11 0zcl

I also set Offline in the throttling drop-down to test the multi-upload retries.

With leavePartsOnError: true, uploading the file fails and doesn't retry any part. If set to false (default), the error occurs while analyzing part of the file since it is corrupted (because one of the parts is absent).

@RanVaknin Is it possible to either retry uploading the failed part or the whole upload? I am using @aws-sdk/client-s3: ^3.418.0, @aws-sdk/lib-storage: 3.501.0

glebbars avatar Jan 29 '24 15:01 glebbars

The Upload class uses an S3 client as one of its inputs. This client has a configured retry strategy by default that includes up to 3 attempts.

Because Upload splits the input into buffered chunks, even if the original input is a stream, these UploadPart requests will be retried according to the retryStrategy of the client.

You do not need to modify the retry configuration because the default will retry, but here is some information:

  • https://github.com/aws/aws-sdk-js-v3/blob/main/supplemental-docs/CLIENTS.md#retry-strategy-retrystrategy-retrymode-maxattempts
  • https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-smithy-util-retry/

Once https://github.com/aws/aws-sdk-js-v3/pull/6112 is merged, the only reason to set leavePartsOnError = true will be if you don't want a failed upload to mark the leftover parts as aborted on S3 to save on storage.

kuhe avatar May 21 '24 15:05 kuhe

Since this is addressed in the PR I'm going to close this. Since closed issues are not as closely monitored, if you have any additional concerns about this, please create a new issue.

Thanks, Ran~

RanVaknin avatar May 23 '24 18:05 RanVaknin

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

github-actions[bot] avatar Jun 09 '24 00:06 github-actions[bot]