kubo Gateway: CAR export with selector

Context

https://github.com/ipfs/go-ipfs/pull/8758 adds support for CAR export via Gateway. It exports entire dag as a CAR stream, which does not cover all use cases.

For example, thin clients may want to export unixfs directory root block + its immediate children, or progressively fetch a big DAG from multiple gateway endpoints.

Why we need selector support

Verifiable HTTP Gateway Responses (https://github.com/ipfs/in-web-browsers/issues/128)
- for mobile web browsers (content integrity without battery drain caused by full p2p)
  - mobile browser should be able to traverse huge unixfs directory tree without having to fetch everything (only root block + root blocks of immediate children are needed for generating useful dir listing)
- for IoT devices and other thin clients
  - fetching bigger DAGs progressively, load-balancing/falling back if some gateways are too slow/unreliable – makes HTTP more useful and pushes back the moment when an expensive p2p retrieval has to be spawned

Scope

[ ] query param
[ ] HTTP header
[ ] TBD configurable size budget for CAR stream + UnixFS downloads
[ ] TBD allow selectors everywhere? (UnixFS? dag-cbor/json?)

Proposed design (A) :-1:

The go-car library supports passing selectors, the idea is to add a parameter to do just that.

We have to URL-escape selector somehow, either way, so the choice is between encodeURIComponent and multibase encoding:

Text (JSON) representation:

/ipfs/{cid}?format=car&selector.json=encodeURIComponent({json serialization of selector})

Binary (CBOR) representation:

/ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

Proposed design (B) :+1: :green_heart:

/ipfs/{cid}?format=car&selector={cid2}

Here {cid2} is a CID representing selector data. It could be dag-cbor, dag-json. Small ones could be inlined (with identity hash), bigger ones could be fetched once and reused efficiently.

Proposed design (C) :pray:

Better ideas would be really welcome here :eyes: Please comment below.

My initial thought was to have "single way of passing selectors", but if you find each approach brings value to different use cases, we could support both.

:point_right: NOTE: whatever we come up with here, we most likely want to support the same convention in ipfs dag CLI (and RPC API at /api/v0/dag/*) – see https://github.com/ipfs/go-ipfs/issues/8239

Mar 07 '22 18:03 lidel

I'm not personally a huge fan of selector.<codec>. I wonder if instead of multibase({cbor serialization of selector}) it could be a cid with identity hash, so specifying codec, and multibase

Mar 07 '22 21:03 willscott

I like the idea of it being a CID! Small ones could be inlined, bigger ones could be fetched once and reused efficiently. Added it as (B)

Mar 08 '22 03:03 lidel

Note on cache control: DAG walk implemented by IPLD is deterministic, so we could indicate that response can be cached + (tdb if revalidated in the background).
Note on resuming partial downloads (think: IoT device on poor wifi). HTTP Range requests require knowing total size of CAR upfront, and we are unable to do that without fetching entire thing first.
- This is why we should have CAR+selector based resume logic in place
- Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

Mar 14 '22 16:03 lidel

confirm traversal walks (and thus selectors) have a deterministic canonical order (and if that's not easy enough to point at in a specific heading in our specs and docs, that's a bug in the specs and docs).
- ... mind that CAR order is not deterministic per the CAR spec; CARs are just a bag of blocks. But it should be clear enough for some system to itself declare "this CAR must use the standard order" (and in practice right now I think all of our implementations already emit CARs that do so). Just a subtle distinction about who owns that decision, and which things validate or are strict about that.
fwiw, we did get some resumable selector features lately! https://github.com/ipld/go-ipld-prime/pull/358
fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

Mar 14 '22 17:03 warpfork

fwiw, we did get some resumable selector features lately! https://github.com/ipld/go-ipld-prime/pull/358

My understanding is that this requires basically stored context on the node you are retrieving from, so is more like extra state for resuming a broken connection than resumable selectors.

fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

IMO range requests for CAR files seems like an iffy thing to support on gateways. In the general case they're costly to create and so asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server, since clients and servers may be run and developed by different parties it wouldn't be great to encourage client developers to build tooling around range requests.

Sometimes they're a good idea, for example IIUC https://github.com/filecoin-project/boost/ plans to allow for ingesting data as CAR files with range requests. However, IIUC they have a few benefits

the user they're downloading the data from must have computed the full CAR file ahead of time anyway (to get a CommP for a Filecoin deal)
the user in any event needs to keep serving the the data indefinitely until the transactions are completed because they are the ones requesting the download
there is a built in expiration time for how long to keep the CAR file around which is "until the user is done uploading it to the relevant providers"

However, I suspect in our case having range requests all the time is a bad idea and having it only some of the time is more likely to cause confusion than not. I'm by no means an expert in the various HTTP tools that exist out there though, so maybe this "sometimes range request" pattern is common enough to be worth supporting.

Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

I don't know that I'd do this long before we put other limits on gateway usage like not downloading 100GB files over public gateways. If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yes, we should definitely have some recipes of common selectors or patterns of use. It's going to be a whole new way of people accessing data and therefore of confusing people. It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={cid2}

This mostly makes sense to me, although there are a few footguns I think we should watch out for here. These aren't blockers and people will hopefully do mostly sane things, but IMO when writing new specs here it's better not to leave too much undefined as then you start having to assume the worst case scenario everywhere.

Sane CID limits, I don't know what the magic number is here, but there's some number. Maybe the number isn't relevant here since URL limits might hit us first, but either way there is going to be some maximum CID size we're allowed. If it's relevant we should document it.
I do think it's nice that unlike just sending the selector as a parameter there's a way to actually do the request even with larger selectors. However, a) magic numbers again, there's probably a maximum size of selector we're willing to deal with and if we don't decide then something else (e.g. the block size limit) will kick in here since IIUC the selector has to be a single block unless we start being able to pass selectors into the selector parameter 😄.
Some consumers of the gateway API will be unable to advertise content which means that actually moving your "slightly too big" selector to a place where it can be consumed by gateway requests might be a big pain.

Perhaps off topic and related to https://github.com/ipfs/in-web-browsers/issues/182, and if so lmk and we can resume there.

@lidel this issue mentions CAR export with a selector like /ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

What happens if it's /ipfs/{cid}/some/path?format=car&selector.cbor=multibase({cbor serialization of selector})? Do we do the path resolution before the selector, or just error?
Is there a reason selector usage has to be restricted to CAR export? Any reason we wouldn't want to do this for regular UnixFS rendering at least for files (i.e. if the output of the selector presents as bytes)? In theory this would then allow you to do something like /ipfs/{cid}?selector.cbor=multibase({cbor selector for an ADL interpretting BitTorrent infohash links as bytes}) and get a result on the gateway. Directories seem potentially more complicated though.

Mar 14 '22 23:03 aschmahmann

asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server

Agree, there is dangerous resource usage asymmetry here, and no clear benefit when compared to progressive download with shallow selectors. I updated #8758 – it now returns CAR stream with Accept-Ranges: none to avoid any confusion and incentivize people to use selectors instead.

If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yep, added to the TBD scope, we may extract it to separate issue.

Yes, we should definitely have some recipes of common selectors or patterns of use. [..] It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={s} [..] Do we do the path resolution before the selector [..]

yes

Is there a reason selector usage has to be restricted to CAR export?

no reason to restrict. as we discussed earlier this week, selector could be something we apply to default responses, in which case it would return stream of bytes + we already have means of customizing content-disposition filename for that: /ipfs/{cid}?selector={cid2}&download=true&filename=selector-output.bin

TBD if we want to allow that in this mvp, or add later.

Mar 16 '22 23:03 lidel

This turns out to be more involved, as we are lacking support for dag-json and dag-cbor in various places (e.g. https://github.com/ipfs/go-cid/pull/137, https://github.com/ipfs/go-ipfs/pull/8568). We can't ask users to provide selector CID in any of these formats if we do not support them correctly in our stack.

Blocked until we have dag-cbor and dag-json support story cleaned up in ipfs cid command and go-cid library.

Mar 23 '22 22:03 lidel

I'm working on a project that will want to use this work around verifiable gateway responses. From the discussion above, am I to understand that resuming downloads of CARs will require parsing the CAR as its downloading, keeping track of the CIDs we want but have yet to receive, then, if the download is interrupted, constructing a new request containing the missing CIDs in a selector?

Especially in the low-powered servers use case, download resumption is going to be important, and if the CAR is to be served with Accept-Ranges: none, I'm curious about how we can address this efficiently.

Apr 03 '22 11:04 3456091

there's some work ongoing for more ergonomic selectors to support parts of this. There's recently been selector support added for representing the blocks that constitute a range of a unixfs file.

@hannahhoward - do you have thoughts on where in go-ipfs we need to respect the unixfs reifier / LargeBytes feature detection to get get the same behavior as in graphsync?

Apr 03 '22 12:04 willscott

In my mind, CAR resumption will not be sending the same request again. The idea is for the client to be smart to import as many blocks as possible, and then send follow-up requests for DAG branches which are missing.

Jul 19 '22 21:07 lidel

Dropping some notes after IPFS Thing 2022:

feels like we may want to do more UX work before we pull the trigger on this one
subjective temperature check: ?selector=<selector-as-dag-json-cid> raises eyebrows, not the best UX-wise
- ?selector= opens pandora's box of allowing arbitrary selectors, so we would only safelist a few initially:
  - root+n-levels deep, n-levels without root, a leaf child along with all parents required for resolving it
  - @mikeal suggested hardcoding common selectors in form of predefined URI params.
    - I think we would need at least ?dag-depth=n to unblock use cases that need shallow CARs (n=1 would fetch only the root+child blocks)
new open questions about adding /ipld/ and ipld:// appeared, and ways we could signal things like ADLs, schemas, and selectors in more intuitive, user-friendly way (cc @rangermauve)
- one idea was to flesh out IPLD signaling around this new namespace, and then reuse it on /ipfs/ using ?ipld= parameter.

I am afraid this is blocked until we figure out some unified UX strategy for IPLD signaling (selectors, ADLs).

Jul 19 '22 21:07 lidel

kubo kubo copied to clipboard

Gateway: CAR export with selector

Context

Why we need selector support

Scope

Proposed design (A) :-1:

Proposed design (B) :+1: :green_heart:

Proposed design (C) :pray:

kubo
kubo copied to clipboard