tus-node-server icon indicating copy to clipboard operation
tus-node-server copied to clipboard

@tus/s3-store: failed part upload crashes entire server

Open mitjap opened this issue 9 months ago • 3 comments

When uploadPart or uploadIncompletePart throws (is rejected) entire server crashes. This happens often with Scaleway Object Storage as their service fails with error 500 with non-informative console message An error was encountered in a non-retryable streaming request. from AWS library. After listening for unhandledRejection event I got some more information.

Unhandled Rejection at: Promise {
  <rejected> S3ServiceException [InternalError]: We encountered an internal error. Please try again.
      at throwDefaultError (/app/node_modules/@smithy/smithy-client/dist-cjs/index.js:867:20)
      at /app/node_modules/@smithy/smithy-client/dist-cjs/index.js:876:5
      at de_CommandError (/app/node_modules/@aws-sdk/client-s3/dist-cjs/index.js:4965:14)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async /app/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
      at async /app/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:483:18
      at async /app/node_modules/@smithy/middleware-retry/dist-cjs/index.js:321:38
      at async /app/node_modules/@aws-sdk/middleware-flexible-checksums/dist-cjs/index.js:315:18
      at async /app/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:109:22
      at async /app/node_modules/@aws-sdk/middleware-sdk-s3/dist-cjs/index.js:136:14 {
    '$fault': 'client',
    '$metadata': {
      httpStatusCode: 500,
      requestId: 'txg5c09a7e2fbe14ba89375-0067aa0421',
      extendedRequestId: 'txg5c09a7e2fbe14ba89375-0067aa0421',
      cfId: undefined
    },
    Code: 'InternalError',
    RequestId: 'txg5c09a7e2fbe14ba89375-0067aa0421',
    HostId: 'txg5c09a7e2fbe14ba89375-0067aa0421',
    Resource: '<s3-object-key>.bin'
  },

I believe the problem is that when a deferred promise is created it does not handle rejections. It is only inserted into a list. https://github.com/tus/tus-node-server/blob/81eb03a82658b2e5e97abdb05c26418192320724/packages/s3-store/src/index.ts#L421 At the end promises are aggregated with Promise.all and returned to the caller where an rejection handler is eventually added. https://github.com/tus/tus-node-server/blob/81eb03a82658b2e5e97abdb05c26418192320724/packages/s3-store/src/index.ts#L440 https://github.com/tus/tus-node-server/blob/81eb03a82658b2e5e97abdb05c26418192320724/packages/server/src/server.ts#L188

During this period if any of the promises in a list are rejected, the rejection is not handled.

Steps to reproduce

Use S3 server that sometimes throws errors (e.g. Scaleway Object Storage) or manually throw error. I modified uploadPart(metadata, readStream, partNumber) to randomly reject. Add provided snippet at the top of this function. This will result in server crash.

const errorProbability = Math.random()
if (errorProbability < 0.2) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      reject(new Error('test'));
    }, 1000);
  })
} 

Expected behavior

Server should not crash. Failed part upload should end the current upload with HTTP 500 Internal Server Error status code.

Observation

Promises (rejections) returned by acquiredPermit?.release() and permit?.release() are also not handled and would cause server to crash.

mitjap avatar Feb 10 '25 20:02 mitjap

I think this issue could be somehow "managed" with a simple modification. When a deferred promise is rejected just destroy the streamSplitter to stop accepting more data.

promises.push(deferred.catch((error) => {
  splitterStream.destroy(error);
}))

Error will properly be propagated as part of pipeline. Error 500 will be sent to client and server does not crash.

try {
  await streamProm.pipeline(readStream, splitterStream)
} catch {
....
}

Next issue is that await Promise.all(promises) resolves too fast. While there are still parts being uploaded the promise is rejected as soon as first one is rejected. Leaving others dangling. We need to wait for all promises to resolve before proceeding.

This could be solved with next change. First wait for all promises to settle (either successfully or not) and then reject with an error if necessary.

await Promise.allSettled(promises); // wait for all to settle
await Promise.all(promises); // reject is any promise rejected

Next problem is with CancellationContext which is aborted as soon as streamSplitter is destroyed (this ends incoming request), releasing a lock, allowing the user/client to make a HEAD request and then PATCH request to a resource that is still being processed. Any lock should be kept locked until all processing is done (until all promises resolve). AbortController can be used to cancel pending promises faster, but not all can be aborted (e.g. uploadPart, uploadIncompletePart) and need to be waited for.

mitjap avatar Feb 10 '25 21:02 mitjap

I think it would be nice to fix and release this on 1.x before doing the major release. I don't like bug fixes on major releases.

Since you wrote the plan down already, do you also want to work on it?

Murderlon avatar Feb 12 '25 11:02 Murderlon

I can solve issues with S3 store regarding how it handles errors from S3 provider.

Issues with locking mechanism and CancellationContext are much larger and would need a separate task and discussion. Fixing this would most probably cause major bump in datastores as all data stores would need to be aware of CancellationContext to abort whatever they are doing.

mitjap avatar Feb 12 '25 13:02 mitjap