uppy icon indicating copy to clipboard operation
uppy copied to clipboard

Retry on failed multipart upload does not reuse already uploaded parts

Open binairo opened this issue 7 months ago • 4 comments

Initial checklist

  • [x] I understand this is a bug report and questions should be posted in the Community Forum
  • [x] I searched issues and couldn’t find anything (or linked relevant results below)

Link to runnable example

No response

Steps to reproduce

  • Use AWS-S3 plugin
  • Upload a file that is big enough to trigger multipart uploads
  • Mimic a network issue (I take down my local S3 Minio storage)
  • Upload will go into 'errored' state
  • Fix network issue
  • Retry upload
  • Upload will start from scratch, discarding any uploaded parts (new multipart upload is created)

Expected behavior

One of the main selling points of multipart uploads is that you don't have to re-upload the whole file if something goes wrong. That currently doesn't work.

What I'd expect:

  • Upload multipart file
  • Network issue
  • Upload will go into 'errored' state
  • Retry upload
  • 'listparts' is called to see if there is anything to continue from
  • If yes, continue with rest of parts
  • If no (or listparts fails), safe to start a new upload

Actual behavior

Currently, a multipart upload restarts completely if one of the parts failed.

When it fails, the internal S3MultipartState is reset and abortMultipartUpload is called.

This is quite annoying because especially when you have a bad connection where parts sometimes fail, you don't want to have to upload all the parts that did succeed again.

I noticed that if I take down internet completely, uppy seems to recognize this (although it doesn't always recover properly) but with our users it's more unreliable connections that cause issues, not complete disconnects.

binairo avatar May 22 '25 08:05 binairo

I have found another situation where this is not working as expected, described in #5927

In Firefox only, on a page unload (e.g. refresh), all in-progress files are aborted, therefore marked as FAILED just before the page is actually unloaded (unlike Chrome where no code runs).

I have patched Golden Retriever to handle complete events when there are FAILED files differently, but the aborted files still get completely reset and do not continue with already uploaded parts.

Edit: I've worked around this particular Firefox annoyance by pausing the upload altogether, but the underlying issue of FAILED uploads being completely reset still happens in other situations.

binairo avatar Aug 25 '25 13:08 binairo

Hi! Not able to reproduce this in chrome. When I cut the network during an upload (chrome devtools), it just fails a few network requests for the next parts, then it keeps retrying until i reconnect (retryDelays options). Once re-connecting, it continues progressing where it left off. Even if I reload the browser, and drag-drop the file again, it starts at the position where it left off. would be interesting to see how your uppy options look like.

mifi avatar Sep 17 '25 22:09 mifi

Hi, thanks for looking into this!

I've delved into this a bit deeper and the issue is actually more complex than I thought and it's actually not one issue.

Networking problems

I've noticed that turning off network in both Firefox and Chrome Devtools doesn't always cancel active requests but mostly applies to new request. So this is not a very representative test of actual network issue.

If you cut the network in Firefox Devtools, the experience is different from the experience in Chrome. Sometimes, and partially depending on the interrupted call (PUT of a part or request to sign the next part), the upload hangs (and cannot be retried or continued) and sometimes it fails completely and has to start over. But not always, sometimes things resume just fine 🤷

I am not sure why Firefox treats the interupted calls differently from Chrome or why it sometimes does work, and sometimes doesn't. It seems that calls to signPart that fail are more likely to trigger a complete fail.

Server issues

Where Chrome and Firefox behave differently when it comes to networking issues, they both fail in the same way when the server fails.

I'm able to reproduce the original issue in Chrome too by stopping my local Minio instance that I'm uploading to. Uppy will keep retrying for a while before marking the file as 'failed'. Retry will restart from scratch.

The same happens when I stop my backend that handle the signPart requests.

If something fails outside the browser and connection, I guess it makes a bit more sense to say 'this failed for unknown reasons, retry from scratch'. But I can imagine situations (servers with high load, network issues outside local connection, ...) where a retry a bit later will just work, so maybe Uppy should decide whether to restart from scratch based on the result of the listParts call? If it gets parts, it's safe to resume, else restart from scratch.

No back-off when retrying

While debugging I noticed that (both in Chrome and Firefox) there is no back-off implemented in the retry. For the PUT of a part there seems to be a bit of delay between retries but it's the same delay every time. For the signPart, there is no back-off or delay at all. If a call fails, it is immediatly retried and the (unavailable) server is hammered with dozens of calls before Uppy gives up and marks the upload as failed.

Hope this helps!

binairo avatar Sep 18 '25 09:09 binairo

Thanks for researching. I did some testing with Firefox and also noticed something fishy. I can't reproduce it every time but I did see this:

  • Use S3 multipart
  • Start upload a file
  • devtools -> network -> offline
  • it suddenly floods hundreds of these in the js console: Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://localhost:3020/s3/multipart/VbBkA.PjEUxyvYMiOeRcW0yLjNVic2ZGCkOOGCxA1UvC9ikOztxLExBVMXAwiAuV3HQAOXHzWtyhjLU3JSqdeutU6gmFJTpwDxtamiH6O0UwjLIceMQAh95ALlFrDnH_/12?key=my-prefix%2redacted. (Reason: CORS request did not succeed). Status code: (null).
  • devtools -> network -> online
  • The upload gets stuck and never recovers.

I think the whole S3 plugin needs some love when it comes to error handling and retrying. There are many race conditions in this kind of code. Likely related: #5961

mifi avatar Sep 18 '25 20:09 mifi