ipfs-docs icon indicating copy to clipboard operation
ipfs-docs copied to clipboard

Article: options and tradeoffs around data import parameters

Open lidel opened this issue 3 years ago • 9 comments

Most people are ok with whatever chunker and hash function is the current default in commands that import data to IPFS. In case of go-ipfs, these are ipfs add, ipfs dag put, and ipfs block put.

However, one can not only use custom --chunker and --hash function when doing ipfs add, but also choose to produce TrickleDAG instead of MErkleDAG by passing --trickle, enable or disable --raw-leaves, or even write own software that chunks and hashes and assembles UnixFS DAG in novel ways.

One can go beyond that, and import a JSON data as dag-json or dag-cbor, creating data structures beyond regular files and directories.

We need an article that explains:

  • what is the current default when importing files and why
    • chunker (why we use size-based, when to use rabin or buzzhash)
    • hash (why we use sha2-256)
    • raw leaves (possible and default when cidv1 is used, but legacy implementations used cidv0 without this)
    • cid version
      • we should document cid v1 as the default, but note that legacy implementations may use v0
    • dag type ( --trickle better suited for append-only data such as logs?)
  • what are the knobs one can change during import, and what is their impact/tradeoffs
  • things to hitn at, but no need to go to deep
    • note dag-pb alternatives exist, mention dag-json and dag-cbor, and hint when using non-Unixfs DAGs make sense

Prior art:

  • --help explainer around different chunkers https://github.com/ipfs/go-ipfs/pull/8952
  • DAG metadata impacting final CID https://github.com/ipfs/ipfs-docs/issues/1152

lidel avatar Jun 10 '22 19:06 lidel

Hey @lidel ,feel free to assign me. Got time tomorrow for that. :)

RubenKelevra avatar Jun 10 '22 20:06 RubenKelevra

@lidel wrote:

what is the current default when importing files and why

  • chunker (why we use size-based...)

I may need some input here. I actually can't think of a reasonable explanation why size-based is better than a rolling chunker.

Maybe someone like @Stebalien can chime in here and tell me why the decision was made to use a size-based chunker by default. :)

RubenKelevra avatar Jun 14 '22 07:06 RubenKelevra

  • dag type ( --trickle better suited for append-only data such as logs?)

Correct me when I'm wrong, but it's just a little bit less overhead for data which is read from front to back anyway. So any file type with random access will be slowed down.

Logs are not large enough to make any significant difference here, as you can easily fit a list of all chunks of a log in one block.

So while one may think of zip-like archives, iso files or videos, that's also actually not the case. Zip files are random access and iso files can be mounted without reading the full iso as a whole, and video streaming with seeking is pretty much the norm.

I also cannot think of a really good usecase here - so I would flag it as "stable, but experimental" option.

RubenKelevra avatar Jun 14 '22 07:06 RubenKelevra

  • hash (why we use sha2-256)

I feel like I may not be the right person after all to write this article :D I wrote a ticket to change this default actually – and I still think blake2b is the better default. :)

So I guess "standards?" Or "legacy stuff we not dare to change?"

RubenKelevra avatar Jun 14 '22 07:06 RubenKelevra

So overall, just the "why?" and rationale is the blocker for me to write it.

As, I have the opinion that these should be the standards – and don't see good reason to use anything else. :)

  • Rolling chunker aka buzhash
  • cidv1
  • raw-leaves
  • blake2b-256

And I use them everywhere.

So @lidel if you could just give some rationale for the whys (doesn't even need to be full sentences) I'm happy to write it. Just stop me if it gets too detailed ;)

RubenKelevra avatar Jun 14 '22 08:06 RubenKelevra

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

lidel avatar Jun 14 '22 11:06 lidel

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

Alright. :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

Maybe we should just add a "--use-legacy-defaults" to the daemon (and as global flags for all commands) as a flag to free us up from those considerations that people rely on them.

This would also free us up for the long discussed default ports for example, which we also not dare to change for similar reasons. :)

This way we can document the "legacy defaults" once and why they were chosen and then elaborate why the new defaults are better.

I feel that would make more sense when reading - and also more sense when using ipfs.

RubenKelevra avatar Jun 14 '22 12:06 RubenKelevra

@lidel triaging old issues, would you say this is still relevant?

ElPaisano avatar Aug 22 '23 00:08 ElPaisano

@ElPaisano yes, I believe that this is untapped potential in IPFS ecosystem, and having some introductory docs might empower people to innovate in this area.

There is need for two articles (or one with two sections):

  • introductory style that explains on defaults and knobs in software like Kubo and Helia
  • DYI style on writing your own data onboarding tools which do custom chunking (good example in specs here and JS code here)

The goal would be to convey that chunking details are userland feature: anyone can use default chunking or roll their own.

lidel avatar Aug 22 '23 17:08 lidel