js-ipfs-http-client icon indicating copy to clipboard operation
js-ipfs-http-client copied to clipboard

[WIP] feat: add support for chunked uploads

Open hugomrdias opened this issue 6 years ago • 11 comments

This is still a work in progress to add support for chunked uploads (ipfs.add) and fix multiple issues related to adding big files.

Tests are filtered here https://github.com/ipfs/js-ipfs-api/blob/90c40363fbcd55d29307e51f4feabb8be867ded8/test/add-experimental.spec.js#L38-L46 to make review easy, just run ipfs daemon with https://github.com/ipfs/js-ipfs/pull/1540

features/fixes in this PR together with https://github.com/ipfs/js-ipfs/pull/1540:

  • big data add non-chunked (this will either break with browser memory or hit the maxBytes config in the daemon, see next)
  • really big data add chunked (theoretically the limit is daemon disk space or maybe request timeouts)
  • streaming progress reporting
  • error handling and reporting
  • add multiple files with wrapWithDirectory
  • improved browser support, handles File's directly from the input
const files = document.getElementById('file').files;
        this.ipfsApi
            .add([...files], {
                wrapWithDirectory: true,
                experimental: true,
                progress: prog => console.log(`received back: ${prog}`)
                chunkSize: 10 * 1024 * 1024
            })
            .then(console.log)
            .catch(console.error);
  • jsdoc for top level api and more

Notes:

  • trailers https://stackoverflow.com/questions/13371367/do-any-browsers-support-trailers-sent-in-chunked-encoding-responses

Needs:

  • https://github.com/ipfs/js-ipfs/pull/1540

Todo:

  • [x] validate this example works after https://github.com/ipfs/js-ipfs-api/tree/master/examples/upload-file-via-browser
  • [x] what to do with progress ? add another handler ?
  • [x] multiple files return from the daemon only first hash
  • [ ] ~~concurrent upload chunks~~ new PR for this
  • [x] check uuid impl (maybe change to uuid v5 or nano-id)
  • [x] avoid preflight as much has possible
  • [x] callbackify top level
  • [x] try handling non chunked
  • [x] fix multipart boundary handling for non chunked

Related:

hugomrdias avatar Sep 03 '18 16:09 hugomrdias

@lidel your understanding is correct :), updated the PR with some of your feedback

regarding the uuid i had looked into it, for now i want to keep the poor man's version should be safe for now, it goes over math.random a couple of times. (i have a note to go back to this)

final integration will use the normal add api, only with one change a new option called chunkSize, if this option is set to a number we go through the chunked codepath.

about progress im still trying to add directly without files if i succeed this should work the same as right now, if not, one solution i though was adding a new handler uploadProgress.

the current progress handler would still work as-is but only in the last request and it would mean adding to ipfs progress only and uploadProgress would mean upload only progress. With this we wouldn't actually break anything relying on the progress handler the user would only see 0% for a long time (uploading) and on the last request it would update correctly as data goes in ipfs (adding) to improve on this the developer will have the new updloadProgress. Does this make sense ?

hugomrdias avatar Sep 04 '18 11:09 hugomrdias

@hugomrdias thanks!

My view is that we should do our best to to make it work without changing current progress API. Details of the chunked upload should be abstracted away in best-effort fashion and hidden behind existing progress reporter.

What if we detect presence of chunkSize parameter, and switch logic used for progress reporting behind the scenes?

For upload split into N chunks:

  • uploading chunks 1 to (N-1) would show "upload only progress" (initially we could just return % based on the number of uploaded chunks, more resolution can be added later)
  • uploading the last chunk N could show real "add progress" but only when it is bigger than "upload progress"

The end result would be a best-effort progress reporting that works with existing API and that is not stuck at 0% until the last chunk and behaves in expected manner (% always grows).

lidel avatar Sep 04 '18 12:09 lidel

ipfs-chunked-add

hugomrdias avatar Sep 04 '18 17:09 hugomrdias

@lidel the first two topics should be adressed in the last commit

about the resumeable stuff, its mostly:

  • having good errors for failed chunks, http-api should retry those
  • extra GET endpoint to return uploaded chunks with this response http-api should able to figure out the missing chunks and only upload those
  • one thing missing is how too identify a upload session to resume the current uuid is not enough, need to do more research for this

so, lets leave the resume feature to a follow up PR

hugomrdias avatar Sep 10 '18 16:09 hugomrdias

the jsdoc should create some nice docs with documentation.js

npx documentation serve ./js-ipfs-api/src/add2/add2.js -w -f html run this cmd outside of the repo's folder to get the latest documentation.js, aegir still uses an old one.

api-docs

also should give code completion to anyone using editors with jsdoc support api-completion this can bubble up to the top level public api with minimal changes to this file

hugomrdias avatar Sep 11 '18 16:09 hugomrdias

@Stebalien could we get your thoughts on adding this to go-ipfs?

This PR is adding a feature to the HTTP add endpoint that will allow big files to be uploaded to IPFS by making multiple requests.

@lidel kindly put together a good summary of the proposed process:

  • Upload payload is split into small parts (chunkSize = 256000)
  • Each part is sent as a a sequence of HTTP POST requests that have
    • a unique identifier for entire upload session (uuid? – see below)
    • a sequential counter within upload session (a chunk index)
  • API backend needs to support additional HTTP headers to perform re-assembly of entire payload from chunks and passing it to the regular ipfs.files.add call in a transparent manner
    • PR for js-ipfs: https://github.com/ipfs/js-ipfs/pull/1540
    • PR for go-ipfs: (TODO)

Reasons for doing this:

  1. It's not possible to stream a HTTP upload request (in Firefox) without buffering the entire payload into memory first
  2. Has potential to allow resume for failed upload requests

alanshaw avatar Sep 27 '18 11:09 alanshaw

@hugomrdias @lidel I think that regardless of what happens with this PR we need to switch to using the streaming fetch API. Firefox is notably the only browser that hasn't shipped the streams API yet but it sounds like this might happen soon. I think we can conditionally opt out of it for Firefox for the time being.

Switching to using the streaming fetch API will solve the buffering issue without any changes to the HTTP API and depending on priorities for go-ipfs we might be able to ship this before chunked uploads.

It's also worth noting that, streaming fetch will be way more efficient then multiple HTTP requests for chunked uploading.

alanshaw avatar Sep 27 '18 11:09 alanshaw

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

hugomrdias avatar Sep 27 '18 12:09 hugomrdias

That's only for response bodies not request bodies, this is the only way currently available to us. I didn't find any indication that request bodies will get streams soon on any browser

You're absolutely right - my bad. Thanks for clarifying!

alanshaw avatar Sep 28 '18 09:09 alanshaw

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

  • a proposal of open protocol for resumable file uploads: https://tus.io / https://tus.io/protocols/resumable-upload.html

lidel avatar Jul 10 '19 12:07 lidel

Sounds like worth mentioning here, in case concepts from this PR are revisited in the future:

yep i based the impl in tus

hugomrdias avatar Jul 11 '19 15:07 hugomrdias