solidity icon indicating copy to clipboard operation
solidity copied to clipboard

IPFS hash feature use non-specified algorithm which is not widely compatible in the ecosystem

Open Jorropo opened this issue 2 years ago • 2 comments

It looks like you are implementing what looks like the Kubo defaults, they are nearing 10 years and lack the newest features we support, I thus want to change thoses so I am poking around where people rely on thoses defaults in the ecosystem.

https://github.com/ethereum/solidity/blob/develop/libsolutil/IpfsHash.cpp

Unixfs is an open format which allows for multiple writer implementations to implement their own linking logic such as append logs, content aware chunking (cutting around logical boundries in the content, such as iframes in video files, content in archive formats, ...), more packed representation, ... while all of thoses are automatically compatible with all reader implementations. This as designed lead to a inconsistent hashes in the ecosystem, examples with implementations that produce different CIDs:

Hopefully this serves as a demonstration that unixfs is good at tailoring for usecases, not repeatable hashing of data.

I see 3 potential fixes:

  1. Add an option to the compiler to output a .car file, basically instead of relying on ipfs add magically perfectly outputing the same CID, you do not run 2 chunkers, the solc chunker would output the blocks in an archive and then the user could ipfs dag import (which read blocks for blocks instead of chunking). This is how chunkers are meant to work (this or using some other transport than car).
  2. Write a proposal and make a new spec for repeatable unixfs chunkers inside ipfs/specs and implement it, you could then use a single link inline CID with metadata to embed that into the CID. So the CIDs would encode unixfs-balanced-chunksize-256KiB-dag-pb-leaves-... and could be fed into an other implementation to have it the same.
  3. Replace all the multiblock and dagpb logic with a raw-blake3 CID. The reason we use the unixfs merkle dag format is unlike plain sha256 it supports for easy incremental verification, seeking (downloading random parts of the file without having to download the full file) and has very high exponential fanout (allows to do parallel multipeer downloads). All of thoses features are available builtin in well specified hash functions blake3 being one of them, this removes support for the most esoteric one like custom chunking, but instead adding the same files multiple times. Blake3 is also used by default by the new github.com/n0-computer/iroh implementation.

TL;DR:

You implement unixfs which is not a specified repeatable hash function (the same input can hash to different hashes depending on how the internal merkle-datastructure is built which is usecase dependent). Given your usecase is simple usually small text files I belive you should switch to use plain blake3 instead which is a well fixed merkletree (instead of the loose merkledag unixfs is).

Note 0

out of all the IPFS implementations I know only iroh knows how to handle blake3 incremental verification yet, other Kubo & friends supports blake3 but as dumb hashes, so it still uses unixfs + blake3 to handle files above the block limit 1~4MiB, we are intrested in adding this capability in the future.

Note 1

Even tho there is a one to many file bytes → CID unixfs relationship, assuming cryptographically secure hash functions there always is a unique CID → bytes relationship.

Note 2

Blake3 might not be the best sollution, what I am sure is that relying on random unspecified behaviours of some old piece of software is definitely wrong. :)

Jorropo avatar Jul 08 '23 15:07 Jorropo

Thanks for raising the issue. We at Sourcify make full use of this feature of Solidity to verify contracts including the contract metadata IPFS hash and make the metadata and source codes available on IPFS (see the playground)

I can't fully grasp the technical details but we ran into a similar reproducibility issue for some time when we switched our IPFS client to add with --nocopy. This changed the CID as it changed the chunking algorithm to use raw leaves instead of dag-pb, IIRC.

At the time, we also thought if it would make sense to use dag-json to encode the contract metadata, which is a JSON object. From what I understand, this is a better way to encode a JSON object and would remove the potential indeterminism caused by formatting, key ordering etc.

We are also serving all the files in our repo (/ipns/repo.sourcify.dev) which is basically a filesystem of millions of small files (metadata.json + solidity contracts). This is at times painful to manage when moving, sharing, others pinning the repo etc. and we were thinking if we could have a more optimal structure with IPLD or something similar to a database. Also because the repo being only a filesystem limits us in many ways compared to a DB we would be able to do queries and easily get stats/analytics of the repository. While discussing the how the Solidity compiler does the CID encoding, it might make sense to keep in mind this use case too.

Looking forward to your input and discussion.

kuzdogan avatar Aug 11 '23 04:08 kuzdogan

oh wow just saw this thread. if sourcify wants to design a more DB-like interface, research moving to CAR files as Jorropo suggesting (w3up cli could be useful testing out this approach), or anything else, feel free to ping me for help!

bumblefudge avatar Dec 14 '23 17:12 bumblefudge

Since this issue is one of the metadata improvements requested by Sourcify, I looked into it now to see what we can do about it. The description is fairly dense and assumes some knowledge of IPLD and IPFS concepts, so I needed a bit of catching up. This post describes my current understanding of how things work - hopefully it'll save others the same effort. I'll post a response in a separate comment.

IPLD

This seems like a good place to start: A Terse, Quick IPLD Primer for the Engineer.

Data model

IPLD distinguishes between abstract data (that follows the data model that very much resembles JSON) and binary blocks (serialization of that data). Data in this system is basically a tree, where leaves are values of basic types (numbers, booleans, strings, etc.) and inner nodes represent more complex types (maps, lists). Data can be encoded into blocks and decoded back using a codec.

Codecs

Widely used codecs include:

  • DAG-CBOR, which produces binary CBOR representation and can handle the entire data model.
  • DAG-JSON, which produces a text JSON representation. It can handle only a subset of the data model - specifically it cannot deal with "/" keys and non-unicode strings.
  • DAG-PB, which produces binary protobuf messages. It's even more limited than DAG-JSON in terms of data model support.
  • raw is limited to data that represents a stream of bytes and produces the same stream of bytes as output.

Links

There can be directed links between blocks, which is how the data is connected into a DAG. A link is an abstract concept that represents some form of content-based addressing, but does not prescribe how it is realized. The concrete realization of that concept used by IPLD is the CID (Content IDentifier). Note that while a link is present in the abstract data, it can only point at an encoded representation, which necessarily requires a choice of codec and a specific representation for the link.

.car format

.car is a file format for storing IPLD blocks. It can store a complete DAG or a loose collection of encoded blocks.

Multiformats

A concept closely related to the CID are Multiformats. Specifically multihash/multicodec/multibase. They simply represent an annotated value. E.g. a hash combined with the identifier of a hashing function attached to it. Or baseN encoding that also specifies N.

CID

There are currently 2 CID versions:

  • CIDv0 - just a base58-encoded multihash. Codec is unspecified.
  • CIDv1 - a more general format that includes a CID version, multibase, multicodec and a multihash.

IPFS

IPFS uses IPLD concepts to implement a distributed filesystem. It actually provides two filesystems right now:

UnixFS uses IPLD to describe file content and metadata. Files and/or metadata are encoded into blocks. DAGs can represent whole directory trees.

Chunking

Using a single block to represent a huge file comes with many problems: you need the whole file to verify the block against the hash and you cannot easily seek to a random position or parallelize the download. To solve this problem, the UnixFS data format allows for splitting files into smaller chunks. As a result even a single file may map to multiple DAG blocks. The address of the file is really the address of the root of that file DAG.

The structure of the file DAG is not uniquely determined by the content of the file, because there is no one universal chunking size and there are different ways to link blocks containing those chunks. Different clients will likely create different DAGs when given a file, resulting in a IPFS link (CID).

BLAKE3 in UnixFS

A special property of the BLAKE3 hashing algorithm is that it works with a Merkle tree structure. It splits the input into 4 kiB chunks, builds a tree and uses the root of the tree as the hash. Using it (or any other hashing algorithm that works like this) in the CID allows the IPFS clients to exchange and verify pieces of a block, making explicit UnixFS chunking unnecessary.

BLAKE3 support in clients seems still relatively new, which is the limiting factor for its more widespread use.

Block encoding in UnixFS

UnixFS assumes that data is always encoded using either DAG-PB or raw. The spec actually describes the structure of already encoded (protobuf) blocks rather than the abstract data. When raw is used, it is implicitly assumed that the block contains the content a single, unchunked file without any metadata. In the past DAG-PB was used for everything. Now the default is DAG-PB for inner nodes containing file metadata and raw for leaves storing the content.

UnixFSv2 seems to be in the works, with DAG-CBOR encoding used by default, but information about it is sparse. I've seen only a bunch of archived repos and some mentions in old forum posts. EDIT: According to @Jorropo the project is likely dead.

The data format is shown in the IPFS docs, though the description is incomplete. It does not fully explain how the content is actually linked together (i.e. PBLink and PBNode), making it seem that what it shows is the whole block. It also does not explain things like Tsize. There is a PR with a more complete UnixFS spec and it has that detail in src/unixfs-data-format.md. IPFS data types in ipfs-search docs also has some useful information.

IPFS links created by solc

solc uses CIDv0 links, leaving the codec unspecified.

The abstract data from IPLD models in our case is the complete UnixFS data format rather than our metadata JSON. Metadata JSON is only a part of it and is treated as a binary blob. The codec used is DAG-PB, which is one of the two choices allowed by UnixFS (and might have been the only one back when we implemented this feature).

The choice of hash function

IPLD seems to be agnostic to the choice of the hash function. Even CIDv0 used a multihash instead of mandating a specific algorithm. Multihash format includes a growing list of available hash IDs.

I do not see a closed list of hashes for either IPFS or UnixFS so if there are any limitations, they must be implementation-specific. In particular, it does not look like there's any inherent limitation preventing us from still using SHA256 with CIDv1. Theoretically it should be possible to use any block encoding, multiblock structure and chunking algorithm without switching to BLAKE3.

~My impression is that what the issue description calls a "hash function" is not the hashing algorithm but the whole transformation from file content to CID. Which is a valid way to look at things, but also makes the terminology ambiguous. It makes it sound like SHA256 is no longer supported by IPFS clients. It would be good to get some clarification on that.~

However, the advantage of BLAKE3 is that it makes explicit UnixFS chunking unnecessary in the first place. It allows the client to do implicit chunking while keeping the whole file in a single block.


EDIT: Text updated with new information from https://github.com/ethereum/solidity/issues/14389#issuecomment-3148718678.

cameel avatar Aug 03 '25 16:08 cameel

Now the question is what do we do about it.

CID version

As documented in Encoding of the Metadata Hash in the Bytecode, the type of hash we use is not hard-coded in the compiler. Currently we let users choose between:

  • ipfs: CIDv0, --metadata-hash ipfs
  • bzzr1: Swarm, --metadata-hash swarm
  • bzzr0: older Swarm, no longer selectable

We can easily add CIDv1 under a new ID, e.g. ipfs1 or cid1. We can also eventually make it the default, though that will have to wait for the next breaking version.

Open questions:

  • Naming
    • What ID to use for the new hash in CBOR metadata?
    • What name to use for the new hash in --metadata-hash?
      • The fact that --metadata-hash ipfs is not the recommended choice for IPFS will be unintuitive to users. Perhaps we should call it ipfs2, though that could also cause confusion once CIDv2 becomes a thing. ipfs1 would be more correct, but probably also confusing.
      • We could rename both in the next breaking version. E.g. ipfs0 and ipfs1 or cidv0 and cidv1.
  • Should we keep the ipfs hash type long term or deprecate it?

Hash function

We could switch to BLAKE3. The question is whether we have to. My impression so far is that SHA256 should still be an option and even CIDv0 uses a multihash so it is to the client which one to use. The hashing thing mentioned in the description seems to be is more about deterministic chunking than a specific hashing algorithm and BLAKE3 seems to be just a popular choice, but not necessarily the only one.

The downside of BLAKE3 for us is that it would add another dependency to the codebase and we generally try to limit the number of those. We already have access to a perfectly fine hash function: KECCAK256. SHA256 is also just fine. The issue description also mentions potentially lacking support for BLAKE3 (though things could have changed in the last 2 years).

Open questions:

  • Is there any good reason why we can't choose KECCAK256 or SHA256?

Chunking

I agree that with how small the metadata usually is, the use of chunking seems like an unnecessary complication. The content is small enough that it should fit into a single chunk most of the time. Similarly with the .sol files. We should use non-chunked representation by default.

I don't think we can drop chunking completely though. From what I've read, the clients have a max block size they support and it is something on the order of 1 MB. That's pretty large for a source file but not inconceivably so. With a lot of comments and whitespace, I could see real-life projects reaching that in corner cases. When --metadata-literal option is used, this will also result in the JSON metadata being just as large.

Open questions:

  • Has any established practice for this appeared by now?
    • If not, is there any significant downside to just continuing what we do now for such files?

Exposing the IPFS blocks in a .car container

I see very little downside to this, regardless of what else we change. We should do this. Internally, the compiler has to generate those blocks anyway to generate the CID. Adding an output that returns it is not all that much effort on top.

The main reason not to do this would have been the assumption that this is unnecessary and there is an unambiguous transformation from file content to an IPFS link. @Jorropo's post clearly shows this is not the case.

The container should include both the metadata JSON and the source files it links to.

Open questions:

  • Naming
    • What to call the output. --metadata-car?

Encoding

From the perspective of solc, the encoding is an implementation detail. What matters is the content of the JSON metadata and the .sol files. As long as it can be retrieved and gets a predictable address, the format used to store it in the P2P network matters very little. We should use whatever is common. Switching to raw for unchunked files seems perfectly fine to me.

DAG-JSON

As for DAG-JSON, I think there is a bit of a misunderstanding here. Switching to DAG-JSON would just result in UnixFS blocks being JSON-encoded, which does not even seem to be an option with UnixFSv1. It would not affect our metadata JSON, which would still be a blob of serialized JSON, with its structure being opaque to IPFS.

As far as I can tell, IPFS is designed to use IPLD only for its own data structures. User data is assumed to be a stream of bytes. It maps to a byte array in the IPLD data model. There is no other way to represent it in IPFS. Since our metadata is JSON, it would be better if we could represent it in a structured way, but DAG-JSON does not seem to be the way to do this.

Canonical serialization format for JSON metadata

If the goal is to make metadata CID independent of things like whitespace or key sorting in the original JSON input, DAG-JSON won't help. What we can do instead to document rules for how to serialize it into a canonical form instead of assuming that whatever solc happens to output is that form. This will have other benefits, like making it actually portable between different JSON libraries and removing annoyances such as solc being unable to format it in a more readable way without affecting the CID.

I think we should do that and it's actually independent of the IPFS changes requested in this issue. We just need to:

  1. Document the canonicalization rules.
  2. Introduce a new metadata output that returns proper, formatted JSON rather than a string containing serialized JSON.
    • If we do it in a breaking version, I think it would also be fine to just make --metadata output work like this. The types are different (string vs object), so it will not even be ambiguous which one you got.
    • We could also consider keeping the original output as something like --metadata-canonical to let tools inspect what the canonical form looks like.

Open questions:

  • How to know whether the metadata stored on IPFS follows those rules and can be safely deserialized and serialized without having to store the original string?
    • Option A: set the rules based on what solc does today. This way everything produced by solc so far by definition follows them. However, this is only an option if what it does is consistent.
    • Option B: bump the metadata format version and do this only with v2.

solc-bin

Whatever we choose, we should switch to the same CID and chunking settings in solc-bin. The fact that the method we're using is in some way outdated actually came up in #12421. Old libs were deprecated and what should have been a simple change resulted in a refactor, which seemed overly complex for that it was and was hindered by our limited understanding of what actually change on the IPFS size.

cameel avatar Aug 03 '25 16:08 cameel

UnixFSv2 seems to be in the works, with DAG-CBOR encoding used by default, but information about it is sparse. I've seen only a bunch of archived repos and some mentions in old forum posts.

It has been a year I havn't been working on IPFS, but from memory it were already a dead project back then.


We could switch to BLAKE3. The question is whether we have to. My impression so far is that SHA256 should still be an option and even CIDv0 uses a multihash so it is to the client which one to use. The hashing thing mentioned in the description seems to be is more about deterministic chunking than a specific hashing algorithm and BLAKE3 seems to be just a popular choice, but not necessarily the only one.

The reason to use blake3 is because blake3 itself is an incrementally verifiable merkle-tree, however unlike unixfs it is a hash function that means blake3 has specified the chunking algorithm to use (4Ki blocks seeded by the position in the file) as well as the merkle tree parameters (binary, special hash function to combine two leaves into ones, ...). Then IPLD wise you use cidv1-raw-blake3 and in theory loose no features but gain that the same bytes in always result in the same hash out even if different peoples wrote the implementation.

Ecosystem wise unless things changed since a year ago:

  • iroh only supports blake3 incremental hashes, so it can load arbitrarily big blake3 hashes
  • kubo & all applications based on the same stack supports blake3, but non incrementally, that means there is a ~>2MiB limit on how big any blake3 block can be.

The reason I'm arguing against chunking here is because there is no unixfs chunking spec, there are infinitely many valid ways to chunk files. A unixfs parser is able to decode all of them properly, but you can't expect that two arbitrary unixfs encoders to output the same CID given the same bytes.


I don't really understand the part about CBOR. CBOR is a way to encode abstract IPLD data into bytes. But it is still the case that chunking and unixfs parameters depends on the encoder implementation.


You say you don't mind outputting a car, imo that a satisfactory solution, other IPFS clients can import car files without having to rechunk so the hashes stay as-is. It only add new things to solc and does not change how anything already there works. This might be the least intrusive solution here.

Jorropo avatar Aug 03 '25 21:08 Jorropo

@Jorropo Oh, I see. Thanks for the extra detail. So BLAKE3 is important because it's an implicit way to chunk file content without making blocks smaller. And that's why UnixFS chunking is not being standardized in any way - the idea is to eventually abandon UnixFS chunking in favor of raw blocks with BLAKE3 (EDIT: only in use cases that require CID reproducibility), which can get arbitrarily large. It was unclear to me that there are no other reasons to limit the size of a block so I assumed that UnixFS chunking would still be needed, even with raw blocks.

In that case it seems like BLAKE3 is the way forward if we care about people being eventually able to simply hand the files to their IPFS client and expect it to get the same CID without doing anything extra. Though given how messy this still is, it will probably still require the user to at least inspect the CID to figure out the right settings for their client. Which is not ideal, but can be solved with documentation.

I don't really understand the part about CBOR.

Not sure which part you mean, but I guess the one about CBOR metadata? We use the CBOR encoding in the compiler when we append metadata hashes to the bytecode. It was not about CBOR in IPLD.

You say you don't mind outputting a car, imo that a satisfactory solution, other IPFS clients can import car files without having to rechunk so the hashes stay as-is.

Well, now that I understand what you meant about BLAKE3, it seems that we could probably do without the .car output. Even if not all clients can be coerced to use cidv1-raw-blake3 right now, it sounds like it eventually will be the case and for us blocks over 2 MB are just a rare corner case, so we could probably live with it for some time. But still, adding .car output is not a big problem, so there is not much downside to doing it anyway. It will ensure that any problem of this kind in the future can be worked around by just giving the client the blocks to serve.

cameel avatar Aug 07 '25 12:08 cameel

We use the CBOR encoding in the compiler when we append metadata hashes to the bytecode. It was not about CBOR in IPLD.

That explains it thanks.

Even if not all clients can be coerced to use cidv1-raw-blake3 right now, it sounds like it eventually will be the case and for us blocks over 2 MB are just a rare corner case

Using Kubo ipfs add --raw-leaves --hash=blake3 --chunker=size-1048576 works up to 1MB. Using Iroh it just works ✨, up to any size.


I think it's up to you to decide what should be done in solc (.car output or cidv1-raw-blake3), there is no ecosystem consensus that unixfs should be replaced, a fixed reproducible chunker is a downside anytime reproducibility is not needed.

Jorropo avatar Aug 07 '25 13:08 Jorropo

a fixed reproducible chunker is a downside anytime reproducibility is not needed.

Hmm... I thought that being able to independently serve the same content under the same link was a pretty major selling point of IPFS. I'd have expected anything limiting that to be a downside. If it's not reproducible, you may just as well put it on a server.

For us it's definitely a key feature - we are not hosting the content ourselves and there are multiple independent parties who may want to do so (user, verification services). They must all match the link in the bytecode to make it discoverable. So far we assumed that to be straightforward.

cameel avatar Aug 07 '25 16:08 cameel

Hmm... I thought that being able to independently serve the same content under the same link was a pretty major selling point of IPFS. I'd have expected anything limiting that to be a downside. If it's not reproducible, you may just as well put it on a server.

For us it's definitely a key feature - we are not hosting the content ourselves and there are multiple independent parties who may want to do so (user, verification services). They must all match the link in the bytecode to make it discoverable. So far we assumed that to be straightforward.

Not everyone agrees with me, but you don't need to rehash and rechunk contents to achieve that. You can import chunked data (like a .car) file. You can insert links to CIDs directly.

FWIW Kubo, the first ipfs implementation never broke it's hashes, it has consistent hashing and options, ipfs add has been giving the same CIDs for more than a decade. However the defaults are really old. Any new feature that would break hashes is opt-in.

The problems I see:

  • other implementations decided that they didn't had to use nearly a decade obsolete default parameters.
  • there is no specification on how someone else would implement Kubo's format
  • At the time I've opened this issue I was considering updating the defaults.
    • I am no longer working on this.

Jorropo avatar Aug 07 '25 17:08 Jorropo

Thanks @Jorropo for opening this issue and all the input here.

@cameel, your summary is pretty accurate. though there's a bit more nuance to the discussion around BLAKE3, so I thought it might be helpful to provide a bit more context and set the record straight with regards to the current state of UnixFS, CIDs, and the alternatives being discussed.

UnixFS and CID determinism

As both @cameel and @Jorropo mentioned, the main challenge with UnixFS is that CIDs are not deterministic for the same input. This means that the same file tree can yield different CIDs depending on the parameters used by the implementation to generate it, which in some cases, aren't even configurable by the user.

This challenge has been discussed at length recently within the community and has led to a proposal to add CID profiles for UnixFS.

The CID Profiles IPIP introduces a UnixFS configuration profile for files and directories. It starts with a single named profile, provisionally called unixfs-2025, which adopts a baseline set of parameters for UnixFS CIDs, such as chunk size, DAG width, HAMT params, and layout based on the best practices and lessons learned over the years. The goal is to provide a consistent and deterministic way to generate CIDs for UnixFS data across different implementations and encourage the same default parameters be the same across implementations.

This would help with deterministic CID generation for the same data, regardless of the implementation. Moreover it avoid the need to store and move around merkleised data in CAR files (see next section), as the CIDs would be deterministic and consistent across implementations.

Most of the work has already been done to map the current defaults and to ensure IPFS implementations have controls for the relevant UnixFS configuration parameters necessary to generate the same CID (e.g. https://github.com/ipfs/kubo/pull/10774), now it's just a matter of agreeing on a set of defaults for the profile.

UnixFSv2 seems to be in the works, with DAG-CBOR encoding used by default, but information about it is sparse. I've seen only a bunch of archived repos and some mentions in old forum posts. EDIT: According to @Jorropo the project is likely dead.

UnixFSv2 is not actively worked on.

CAR files and Merkleized data

The approach proposed by @Jorropo in the original issue is to use CAR files, which contain the merkleized representation of data with the CID of the root node. CAR files are widely adopted and provide a way to bundle IPFS data with its complete DAG structure.

BLAKE3 and UnixFS

BLAKE3 along with BAO is pretty neat in that it solves the incremental verification problem in a similar fashion to UnixFS, with the added benefit of a canonical DAG structure, which avoids the UnixFS CID reproducibility problem. Moreover, benchmarks suggerst that BLAKE3 is much more efficient than SHA-256.

However, as of today, there's still not as much tooling and public infrastructure to support BLAKE3 and BAO at scale in the IPFS ecosystem. The one exception to this is iroh-blobs, but Iroh isn't interoperable with the rest of the IPFS ecosystem.

You can use BLAKE3 with UnixFS, but it's only used as a hash function without using the internal BLAKE3 merkle tree structure. This means you can still verify data integrity incrementally, but that happens on the UnixFS layer rather than at the BLAKE3 layer.

Therefore, if you plan on sticking with UnixFS, I'd probably continue using the current hashing scheme (SHA-256) for now. This will ensure compatibility with existing tools and infrastructure in the IPFS ecosystem.

Moreover, there's a fair bit of anecdotal evidence suggesting that BLAKE3 is currently slow in browsers — I'm not sure if that's relevant for your use case, but worth considering. It may change in the future as BLAKE3 matures and gains wider adoption, but for now, sticking with SHA-256 is likely the safer bet.

Raw encoding with BLAKE3

If you know most of the files you will address with CIDs is relatively small, (<2MiB), you could consider using raw encoding — forgoing any UnixFS encoding — and use BLAKE3 hashes.

This gives your deterministic CIDs, because you are just hashing the raw data with BLAKE3, and in the future you may even be able to do incremental verification using BAO. However, this approach has some limitations:

  • It doesn't support directories or complex data structures, so it's only suitable for files.
  • If the files are larger than 2MiB, you will have trouble moving these blocks these across most of the popular IPFS implementations, as they will typically reject blocks larger than 2MiB.
  • You can't persist the filenames since they aren't part of the file structure, so you will need to manage the mapping between filenames and CIDs separately.

Canonical serialization format for JSON metadata

The container should include both the metadata JSON and the source files it links to.

If I understand correctly, you have the solidity files which you merkleize with UnixFS to get a CID, and then some additional metadata JSON where you link to CID of the solidity files.

What we can do instead to document rules for how to serialize it into a canonical form instead of assuming that whatever solc happens to output is that form. This will have other benefits, like making it actually portable between different JSON libraries and removing annoyances such as solc being unable to format it in a more readable way without affecting the CID.

Instead of going through the trouble of defining a new set of canonicalisation rules for the metadata JSON, you can leverage the dag-cbor codec in the IPFS/IPFS ecosystem. This is the same approach taken by ATProtocol. Note that all dag-cbor data is valid CBOR, and can be read with any CBOR library. In other words, dag-cbor is a subset of CBOR that is deterministic and canonical.

dag-cbor will ensure that your metadata is serialized deterministically, producing the same CID irrespective of whitespace, while also making it much smaller thanks to binary encoding.

How to know whether the metadata stored on IPFS follows those rules and can be safely deserialized and serialized without having to store the original string?

For metadata, dag-cbor is your friend.

Additional notes

Hmm... I thought that being able to independently serve the same content under the same link was a pretty major selling point of IPFS. I'd have expected anything limiting that to be a downside. If it's not reproducible, you may just as well put it on a server.

This is in part a question of how data is replicated and shared. If the data is initially merkleized, addressed by CID, and pinned to a node, and others use the CID to replicate and pin additional copies, you avoid the CID determinism problem.

The problems arise when multiple parties onboard the same data independently, using different implementations or versions of the same implementation, which may lead to different CIDs for the same content. This is where the CID Profiles IPIP comes in, as it aims to standardize the parameters used to generate CIDs for UnixFS data, ensuring that the same content yields the same CID across implementations by default (while still allowing for custom profiles and parameters).

Final thoughts

In general, my recommendation is to stick with UnixFS. If you follow the unixfs-2025 profile (at the very least 1MiB chunk size and raw encoding for leaf nodes), for files under 1MiB, the CID is just a SHA-256 hash of the file, and for files larger than 1MiB, the CID is derived from the UnixFS DAG structure with a 1MiB chunk size.

This will allow you to take advantage of the existing tools and infrastructure while ensuring that your CIDs are reproducible and verifiable.

2color avatar Aug 18 '25 14:08 2color