kubo icon indicating copy to clipboard operation
kubo copied to clipboard

ipfs files concat [ <local paths> | <cids> ]

Open lidel opened this issue 3 years ago • 6 comments

Documenting discussion with @ikreymer, @rangerMauve and @ribasushi

We are missing a high level API for concatenating existing UnixFS files into bigger ones. Having it would allow for improved deduplication in scenarios when bigger archives in formats like WARC (https://webrecorder.net) consist in big part of smaller files that are already on IPFS, allowing for CID/DAG reuse.

Use cases

  • building big files from preexisting DAGs (e.g. WARC from https://webrecorder.net assembled from standalone files)
  • (TBD) if we include dirs, then also
    • more friendly replacement for deprecated ipfs object patch append-data
    • merging multiple directories without mutating them in MFS

Proposed design

Add concat command to ipfs files that accepts two or more UnixFS-compatible DAGs and returns a CID that is a logical concatenation of all DAGs.

$ ipfs files concat [ /local/mfs/paths | /ipfs/cids ] 
bafy....

FAQ / Open questions

We need to agree how to handle edge cases, below are my initial ideas, feedback on ergonomics and potential implementation caveats is appreciated

  • What happens when passed DAGs are all files?

    • Concatenate them in-order and produce a new UnixFS file that is reusing original DAGs (maximizing deduplication)
  • Should this support directories? It opens additional questions:

    • What happens when passed DAGs are all directories?
      • Create a new directory which has all children from original directories (in-order?)
    • What happens when the first DAG is a directory and all remaining ones are files?
      • Create a new directory which has remaining files added
    • What happens when the first DAG is a file but at least one of the remaining ones is a directory?
      • (A) Return "Error: concatenating directories is possible only when the first DAG is an UnixFS directory"
      • (B) Concatenate everything into a single UnixFS directory (children from directories + standalone files)
    • What happens if the same CID is in two directories under the same name?
      • Should it be duplicated or deduplicated?

lidel avatar Aug 09 '22 17:08 lidel

My take is: hard-error on directories, support only files and pipes. Just like /bin/cat

ribasushi avatar Aug 09 '22 17:08 ribasushi

I put together a test repo using js-unixfs to show how concat could work under the hood with building up nodes from several sub nodes.

https://github.com/RangerMauve/js-ipfs-stitch-test/

Agreed that directories should be an error. I don't think we can cat a UnixFS tree with directories in it, so concatenating a directory in there seems like a separate use case.

RangerMauve avatar Aug 09 '22 18:08 RangerMauve

Another high-level API, which would be super useful, and essentially becomes easy to support, given the core ipfs files concat functionality, is a way to start with a single file and a list of splitpoints/offsets that you'd want to split on.

It could be a subcommand: ipfs files concat add <local path> <split points>, where split points just contains a JSON array, or offset per line, that would then read local path <local path> and add regular those offsets, and then concat the whole thing. Eg. given a 35M file, and offsets [0, 10M, 25M], the command would add 0-10M of file, add 10-25M, and add 25M-35M of the file. Maybe could support other add options, like being able to choose trickle dag?

Maybe there's two subcommands: ipfs files concat add <local path> <split points> and ipfs files concat merge [ <local paths> | <cids> ] if the split files already exist as individual files or already added as CIDs.

This just adds a common first step that would often be needed before using ipfs files concat

ikreymer avatar Aug 09 '22 18:08 ikreymer

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

ribasushi avatar Aug 09 '22 18:08 ribasushi

@ikreymer too complex. You'd simply:

ipfs files concat yourfile:0:20 yourfile:21:40 yourfile:41:

yeah, I guess could live with that, was just thinking the separate split file makes for an easier user API, especially if to be supported in libraries as well as CLI, and maybe dealing with hundreds of split points..

ikreymer avatar Aug 09 '22 19:08 ikreymer

I've implemented a small library in JS that includes concat as well as some related utilities that are useful for the web archiving use case: https://github.com/webrecorder/ipfs-composite-files

ikreymer avatar Aug 23 '22 22:08 ikreymer

Wrote something in go: https://github.com/anjor/unixfs-cat/blob/main/unixfs_cat.go

Happy to work more on it if it's useful/along the lines of the thinking here.

anjor avatar Mar 08 '23 15:03 anjor