CAR CID Lengths

Open DavidBuchanan314 opened this issue 2 years ago • 1 comments

When dealing with CAR files, a minor annoyance of mine is the need to fully parse the CID of a block before you know where the actual block data starts (since CIDs are variable-length). Parsing CIDs is relatively complex, and it's something I'd like to avoid at the CAR-handling layer of my code.

In an ideal world, I'd like to scan over a CAR and represent it in memory as a k/v map, where keys and values are just opaque byte strings, with full parsing/validation of CIDs done only on an on-demand basis.

The most obvious way I can see to solve this is to have the CIDs stored with a varint prefix, storing the CID length, but maybe more elegant options exist.

This isn't a particularly pressing issue, and its easily solved by parsing CIDs early, but I'd just like to put on the record that it's something I'd love to see in a future revision of the CAR format.

Feb 24 '24 19:02 DavidBuchanan314

@DavidBuchanan314 thanks for registering something for the CIDvX wishlist. We have a bunch of different use-cases to solve for. A couple of comments though:

You can bypass "full" CID decoding if you just accept that you have to read 5 varints at the start of each section, you're reading bytes anyway and it's not going to be meaningfully more costly to have those bytes in memory (likely your disk/network/whatever is reading large enough chunks that asking for enough to read 5 varints isn't going to make any difference). Once you have them, you have (1) the whole section length, (2) the offset after reading the varints and (3) the digest length which tells you where the CID ends. Thankfully, in almost all cases each of these varints are a single byte, but of course you have to allowances for them not to be (e.g. codec could be dag-json, so 2 bytes, multihash might be blake2b so that's 2 bytes). So you can skip CID validation as long as you're confident in your varint decoder.
1. Section length
2. CID version - ignore once parsed
3. Codec code - ignore once parsed
4. Multihash code - ignore once parsed
5. Multihash digest length - take note of this
You may want to have a look at CARv2, depending on your use-case, having an index at the end may be useful for these kinds of look-ups. Generally we don't consider CARv2 a good transport format (wasteful and there are trust concerns) but it's great to make a CAR into a really fast immutable blockstore if you can just load the index into memory and know exactly where the block offsets are.

Feb 26 '24 01:02 rvagg