multihash Size limit of identity hash

As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.

I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?

Jul 09 '20 08:07 vmx

64bytes is most definitely a non-starter, as it already won't fit the prefix/varinting of a 512bit hash.

More generally: any data that you know won't ever be repeated is a good candidate for inlining. Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

It would make me sad if we put an arbitrary small limit bound by the ability to print out a CID. Also note that other visible system parts, e.g. dirnames have no limits at present.

Last but not least: there are currently production systems that do inline ~2k or was it ~4k of data on the dht today ( /cc @ianopolous )

Jul 09 '20 08:07 ribasushi

Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

That's only a current IPFS limitation. It's not a general limitation if you look at Multihash in isolation.

To me identity hashing makes sense to save space if the hashing output it bigger than the input. If you use it for something else, then to me it has the smell of some hack around some limitations that could/should be solved on a better way.

Jul 09 '20 08:07 vmx

There are already limits in the go code. We've been using it to inline stuff in Peergos <= 4096+16 bytes (after discussing with core ipfs devs, but I can't find the issue right now @Stebalien, @whyrusleeping ? ). There are earlier discussions on this: https://github.com/ipfs/go-ipfs/issues/4918 https://github.com/multiformats/cid/issues/21

cbor-gen also seems to have just added a limit of 100 for this: https://github.com/whyrusleeping/cbor-gen/pull/24

Jul 09 '20 08:07 ianopolous

@vmx 2MiB-1 is a libp2p limit - it permeates everything ( but you are correct: multihash in isolation is limited by uvarint alone )

It is generally correct that inlining is there to realize savings, but they come in more forms than just space:

absolute space saving ( the one you had in mind ): anything shorter than cid-prefix+hashbits is a win if you get to inline it
storage/index space saving: multiple-logical-blocks packaged as a single physical block by recursive inlining aids storage-page utilization/packing
transport space/time savings: a recursively-inlined dataset can be seen as a "mini-graphsync optimization": i.e. you ask me for node X => you automatically and invariably get its contained nodes Y Z Q and R, enjoy!
operation savings: if you have a construct with multiple nested known-random entries ( i.e. a set of public keys ), having each entry addressable by CID but not having to do an encode/decode pass is likely valuable in contexts I can't think of right this moment. Perhaps @ianopolous can go further into their 4096 + 16 use-case...?

TLDR: We definitely should define a formal limit. I definitely think this limit needs to be in the high kilo byte range, way beyond the proposed 64 bytes.

Jul 09 '20 09:07 ribasushi

way beyond the proposed 64 bytes.

I care more about having an actual limit, than what the limit actually is.

Jul 09 '20 10:07 vmx

I found the discussion I was referring to: https://github.com/ipfs/go-cid/pull/88

Jul 09 '20 11:07 ianopolous

We generally assume that CIDs are small and cheap to handle. CIDs are designed to be identifiers and inline CIDs are designed to allow inlining objects that are smaller than the identifier. Keeping this assumption valid is extremely important for performance and usability.

Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes.
CIDs show up in paths. Ridiculously long garbage in paths is not user friendly.
CIDs now show up in subdomains. The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Last time we discussed this, we set the default limit to 32 bytes to be safe and marked the relevant options in go-ipfs as experimental.

Really, it sounds like the need here is for a way to pass around blocks with type information, right? We may need to distinguish between the two cases.

Jul 09 '20 16:07 Stebalien

The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Afaik this part got entirely nixed, and the current implementation will not support over-long CIDs ( /cc @lidel )

Jul 09 '20 16:07 ribasushi

Where/how would we enforce the limit? We’ve had very similar threads about block size limits and ended up punting any hard limits to the network and storage layer, which in this case isn’t really an option since they rarely decode the block data.

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

This is a good time for this discussion given that inline block support is still under development, but I want to make sure that we’re setting the right expectations about where this might land.

Jul 09 '20 18:07 mikeal

Where/how would we enforce the limit?

When decoding/validating CIDs?

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

Actual support in block stores is just an optimization. Inline CIDs are still usable as normal CIDs.

Jul 10 '20 00:07 Stebalien

We identified two cases:

Data is smaller than the hash would be

I'd call this one the original use case. There seems to be agreement that for this use case, there should be a really low limit, in the bytes, rather than KiB range.

Inlining blocks/blocks with type information

This is the Peergos use case. It's more of an optimization.

Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

Jul 10 '20 09:07 vmx

It seems clear to me that there is a breaking change coming here. So, for what it's worth, I've come up with a backwards compatible solution and we're stopping using identity multihashes to inline small blocks in Peergos now. We can't migrate existing users (we don't have their keys), but I've realised that those that are on our servers are actually ok now because we are using the S3 datastore and managing pins ourselves (indeed if they're logging in through our server they're not even using ipfs at all). Any new users, new data written, or old data modified will not use identity multihashes. I think we can write a gradual migration that runs when they log in. It's painful, but the risk is too high for us.

The largest identity multihash we use now is 36 bytes, for public keys (which includes the type of key and key material all encoded as cbor). And we now enforce this as a hard limit for new data. In the future we'd like to use identity multihashes for CSIDH public keys (64 bytes + multikey header), but that is a future discussion.

Jul 10 '20 11:07 ianopolous

multihash is a general purpose standard with general purpose libraries. Given how young the project is, we should assume that the use cases we currently understand are not a complete set and we need to make sure we stay open and accessible to being used for things in the future that we haven’t even thought of yet.

With that in mind, I don’t think that we should:

Create and enforce limits based on our own opinions about what is “good practice.”
Adopt and enforce limits on behalf of specific use cases.

If there’s a universal reason to have a limit on the size of a multihash that we’re confident is always going to be true, then we should adopt it, but that’s not at all what I’m seeing.

If, as we already know, IPFS wants to use CID’s for subdomains and therefor needs to enforce a size limit on CID’s which in effect limits the size of a multihash, that’s fine, but that’s IPFS’s decision to make and their limit to enforce. That doesn’t belong in the core of multihash because it’s not universally representative of all the use cases someone might build on multihash.

Jul 10 '20 16:07 mikeal

With that in mind, I don’t think that we should:

Create and enforce limits based on our own opinions about what is “good practice.”

I think we should exactly do that. This will prevent systems to randomly blow up when some implementation impose arbitrary limitations. The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

Jul 10 '20 16:07 vmx

The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

Then don’t “support the Identity Hash with maximum compatibility” ;)

Users and implementations are free to make domain specific decisions about these limits, the right decision for one user will not be the same for another. It’s not the job of the underlying primitive to make these decisions on your behalf because we don’t know what each user’s requirements are.

Look at the block limit, to my knowledge only one transport has a real block limit and yet pretty much every user imposes block limits at half the current transport limit because we called it out as a good practice. It’s not a hard requirement in the spec and it’s not enforced by our codec libraries, but it’s a functioning limit everywhere that it matters.

I’m not saying we shouldn’t define good practices, and even document what we think is a reasonable target limit for multihash, but we shouldn’t impose that limit in these libraries at that layer or call it out in the specification as a hard requirement.

If @ianopolous wants to have big multihashes in his CID’s, he shouldn’t have an issue at the multihash layer, even if he will have issues at the IPFS layer. In the same way that I can create 5MB blocks for a one-off use case knowing that if it ever needs to be used in Bitswap it’s going to break.

Jul 10 '20 17:07 mikeal

Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

You can “inline” node data into the block using any codec that supports the full IPLD data model without hacking it into the CID. I can’t see the utility here other than “we forgot to make this part of the data structure a union” and it seems like the right thing to do there would be to fix the data structure to support that because it’s a lot more complex to deal with data that has been inlined into the multihash. I understand that in the case of dag-pb we may not be able to change the data structures, but that doesn’t mean we should port this practice over to users that have access to the complete IPLD Data Model.

It’s not that inlining data isn’t a common and necessary feature, it is, that’s why we fully considered it in dag-cbor and in the IPLD Data Model and have a compelling feature set. If this pattern is common enough we could even consider adding syntax to IPLD Schemas to make kinded unions on links easier, similar to how we have syntactic affordances for making links in general easier.

But, across a lot of our code, data inlined into the multihash throws a wrench in our layer model and is difficult to support across different Block, Link, and storage interfaces. Most code thinks of a CID as a key and its data living somewhere that it can retrieve by that key. If you put the data in the key, the representational pairing of [ key, value ] is lost and there’s not a very clean way to maintain the interfaces without pushing this to users (which is what happens currently, if you put data in the multihash you’re going to be pulling it out and working with it very manually).

Jul 10 '20 18:07 mikeal

If you put the data in the key, the representational pairing of [ key, value ] is lost

why do you say that? Nothing in e.g. ipld-prime land would change:

you encounter a link with CID \x01\x71\x00\x2A{{ 42 bytes of cbor }}
you go to the ~blockstore~ link loader ask for that
the ~blockstore~ link loader turns around instantly and gives you back {{ 42 bytes of cbor }}
profit

Jul 10 '20 19:07 ribasushi

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

Everything we do is dag-cbor - this has nothing to do with protobuf or unixfs.

It is much much more elegant to do it our way (although elegance wasn't our motivation, speed was) because then the same object class always maps to the same ipld structure. The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

it’s a lot more complex to deal with data that has been inlined into the multihash.

It was trivial to support this for us (4 lines of code globally). There's nothing "manual" to it.

Jul 10 '20 19:07 ianopolous

link loader

That’s specific to Go, where the link loader is an abstraction between the storage layer and the decoded node layer. We don’t have that in every language, often the node layer just talks directly to the storage layer, which means every storage API needs to handle this or every line that asks for data by CID needs to handle this.

Also, @warpfork will need to weigh in, but I recall him mentioning that there are plenty of things in go-ipld-prime that won’t work well when inlining data this way.

It is much much more elegant

The solution we spent considerable time working through to this problem is unions (mostly kinded unions using Link as a kind). It’s a core feature of IPLD Schemas and translates nicely into every programming language and all the abstractions we’ve built.

It’s problematic to have multiple approaches to inlining data and a kinded union provides a much cleaner approach that keeps the type differences clear to everyone. It sounds like you actually want to blur the line a bit on the type differences so I can see how that approach with be more attractive, but as we build out generic libraries it’s rather difficult to have a single type mean very different things.

That said, we’re not going to break or disallow anything that is valid CID/multihash, we just may not have a very nice interface for you to use when you inline data this way, which you probably don’t care about since you have your own libraries ;) And as I’ve already stated, I’m rather opposed to setting a hard limit on multihash size in the specs or core implementations. Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit, none of which are an issue if you were to take the kinded union approach instead.

Jul 10 '20 19:07 mikeal

It’s problematic to have multiple approaches to inlining data and a kinded union ...

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

To put it differently: from link-traversal perspective there is no practical difference between:

a 4096-byte identity CID
a 32768-bit output from keccak's shake functions

We currently support both. The proposal is to limit only one of them, on account of one of them being special. How do these 2 examples differ?

Jul 10 '20 19:07 ribasushi

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

Once you recognize that links in a node graph are transparently traversed, that the link is resolved to a node that replaces the link representation to become the node representation for that property in the parent node, they are functionally equivalent. Both put the node data in the same place and stored in the same block.

Conceptually, this never exists in the decoded node graph:

ParentNode -> Link -> ChildNode

Instead, this is what happens:

// before resolution
ParentNode -> Link
// after resolution
ParentNode -> ChildNode

You can observe this in our pathing, where named properties that are links get resolved to their decoded node value. There’s actually no way to return a link from a fully resolved path.

let value = 1234
let link = Link( value )
{ property: link }

If you resolve the path /property of this block you’ll get 1234, there’s no pathable reference to the link itself.

Jul 10 '20 20:07 mikeal

Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit.

That's exactly the issue, when we don't specify a limit. Your application would work on one implementation, but not by default on some other. Finding out that one of your identity hashes is to long, sounds like a very hard to find bug if you are not aware that there might be a limit.

The option might be configurable, but it might not even be configurable at runtime, but at compile time only.

Having a limit makes building systems that work everywhere easier, not having a limit makes it harder with little benefit.

Storing data in Multihash to me sounds wrong, that is not what Multihash means to me. Though it's kind of a neat hack, so if we want to support large data (still with a limit, but a high one) stored within a Multihash, I suggest we introduce a new codec for that. This way codecs can still support the "the hash of my data is bigger than the data itself" use case, while not being forced to also support "i store data in my multihash" use case.

This way also the error reporting will be clear, as you are now see that your codec is not supported instead of having it fail in weird ways.

Jul 13 '20 09:07 vmx

I think a lot of the important things have already been said here, but I've been called out, so I guess I feel I ought to weigh in on the record, heh. I'll mostly just re-highlight things that have already been said that I agree with and want to boost, though:

It is true that we have typically eschewed hard limits in IPLD, while acknowledging that they exist in systems we're likely to be used together with (such as block size limits from IPFS / libp2p), and advising that people probably want to stay within those limits unless they're confident their use-case means they aren't worried about this. I think we have no regrets on this approach so far!
It is simultaneously true that we may want to note a size boundary recommendation, per the same logic as the previous point.
I'd probably vote for having length limits be a parameter of which multihash we're talking about.
- FWIW: I've actually always been a little surprised we have length parameters on all of the multihashes to begin with. I see why it's a parameter for e.g. blake2 variants; it was originally quite a shock to me the first time I saw it was also present for, say, plain ol' sha256.
I strongly agree with @Stebalien 's observations: "We generally assume that CIDs are small and cheap to handle. Keeping this assumption valid is extremely important for performance and usability. Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes." Yep; yep; and yep. I am not a fan of large values appearing in CIDs.
Inline CIDs... okay. Let's break this down into a few sub-bullet points:
- I don't think inline CIDs were a good idea, and I don't think we'll be encouraging anyone to use them in a year's time. (I don't think I'd encourage anyone to use them now, in fact; and I might place a bet on the table that we'll be actively and overtly regretting them in a year's time.) The following points are why:
- We have a mechanism for talking about a choice between "inline" data and links now! In the IPLD Data Model, you can do this freely; it's so innately up-to-you that it's barely possible to talk about a world in which it's... not. (This is what @mikeal is getting at in some of his comments. Inline CIDs might've made sense in some applications based around dag-pb, I think, maybe? But that world is... something we should regard as fading into the past very quickly: the IPLD Data Model is much more expressive than dag-pb.)
- We have a well documented and explicitly tool-supported way to choose between inline and linked data, as well: IPLD Schemas have copious support for unions, and various forms of indicating them: including some which simply differentiate based on whether the serial data is a link or some other Data Model kind. (These are the same concepts as you can use without the Schema layer; I highlight it in terms of the Schema system only to emphasize how very, very supported it is.)
- These mechanisms for choosing whether you're using links or "inlining" the data (by just... not putting it into a "link" format at all) work without causing additional serialization calls just to put the data into bytes for an "inline CID", which is... superior.
- These mechanisms for choosing whether you're using links or "inlining" the data are exposed very clearly at the library layer: you just... manipulate your data using the standard Data Model semantics, which resemble an AST. These continue to work and compose even when you're doing deep traversals, etc.
- Contrariwise: we do not have particularly good and clear ways to do high level operations on large graphs of data, whilst constantly deciding which links are going to be encoded as inline or not. It's possible... but I wouldn't say it's smooth, or a corner that we get a better developer experience by regularly asking people to shine a light into that crevasse.
- Given all the other caveats about inline CIDs discussed above (e.g., big sizes cause various major usability questionmarks)... really, there's just... I struggle to think of much good at all that comes from inline CIDs. We have clear alternatives; I can't imagine we won't grow to prefer them.
- n.b. I don't want this screed to be mistaken for a claim we should drop support for inline CIDs either; that's a stronger case than I'm willing to make. But I think it's also important to make remarks on how much energy we want to put behind supporting them nicely. IMO: a very limited amount; and I definitely don't think we should let them have predominating influence on any other design choices we wrangle with.
"There’s actually no way to return a link from a fully resolved path." -> okay, this is true, but I do actually regard that as probably a design bug, fwiw :) I'd like to design our way out of this. The blind spot has hurt us surprisingly little so far, but it does bother me.
- The rest of the arguments made in that comment still stand though! Even though I want to design our way out of this blind spot, my current suspicion is that we'll make "advanced pathing" variants that allow introspecting links, but they'll not be the default path we lead users down in most cases.

@ianopolous -- I'd be happy to talk to you more synchronously about this if you'd like, but in essence,

The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

... that sounds less like a "workaround" and more like "exactly the right thing" to me :)

Maybe there's some different way to organize the type definitions in your language of choice that would make it more natural? Dunno; I'd be willing to look at it with you though, if you'd like. I'll also say: goodness knows golang hasn't been making it exactly easy on me to represent unions either! But it's been a logically sound path to pursue, even when my host programming language hasn't made it frictionless. So far, every time push has come to shove, I've been very happy with the outcomes stemming from our pursuit of unions.

Jul 13 '20 15:07 warpfork

(Perhaps that comment would be easier to read if we introduced some term other than "inline" for when we put data in the same block rather than a separate block -- I used it describe that general practice in the same comment as discussing "inline CIDs", and that's probably confusing. Forgive me, reader. A better phrasing has not, at this moment, yet occurred to me.)

Jul 13 '20 15:07 warpfork

To follow up a little more concretely on that allusion to "we have a way to talk about links vs ~inline~ embeded data now" --

This is a schema snippet that we could use to describe this common scenario:

type ThingSomewhere struct {
    foo String
    bar Bytes
}

type ThingHereOrNot union {
    | ThingSomewhere map
    | &ThingSomewhere link
} representation kinded

type EnclosingFwoop struct {
    couldBeEmbededOrBeLink ThingHereOrNot
    otherData String
}

A block containing one object matching the EnclosingFwoop type could have either zero or one links in it, and is still described by this schema in either case, and its transitive graph (if it does contain a link -- or just the block itself, if it doesn't) contains one ThingSomewhere.

At no point were inline CIDs ~harmed~ used in the description of this data; and yet, we have choice over whether or not the data is split into two blocks or not.

I find that this is a very straightforward way to describe this situation: and it works fine with arbitrarily complex structures of data; it works purely in ways that are easy to describe in terms of the Data Model (e.g., without needing to discuss using different variants of CIDs); and because of this simplicity, I think this is generally the sort of approach I would recommend essentially all new code to take in preference to inline CIDs, if at all possible.

I've used the schema syntax here only to clarify and describe. It is not necessary to use schemas to do this; one can simply construct data and walk over the data model according to this convention.

I don't actually know what reasons there would be to prefer using inline CIDs over using this much simpler model of "embed if you want". If there are some very specific situations that really require using a CID for ${external opaque reason}, maybe that's something we should document in a short list of known situations? My suspicion is that list is going to be very short and have the general feeling of being enumerating "exception rather than the rule".

Jul 13 '20 18:07 warpfork

I can explain our case in detail if it helps. Note that we have a work around/"exactly the right thing to do" as mentioned above.

One of our fundamental objects has a field which is ciphertext, so a structureless byte[], which is limited to 5 MiB. This thing can be either a directory or a file, and indeed we explicitly hide this from the ipld layer (you have to decrypt to decide which). We represent this as a class with a field which is a list of cids. Whenever we want the actual ciphertext we just pass the list of cids to the datastore and get back the results (identity hash or not). However, as I mentioned above, this thing can represent a directory, and this means that when we are traversing a path we must retrieve many of these objects, decrypt the ciphertext and recurse. The critical point in our model is that we aren't assuming the datastore is local, normally it is on a remote server. This means that inlining the directory (or small file) data is critical to speed for us because it results in many fewer network round trips/ DHT retrievals.

I'm fine with not using identity mulithashes here, as we're about to migrate towards, but it does result in more complicated code, and I wish there had been a limit or indication that these were meant to be < 100 bytes when it was released in go-ipfs.

On the general topic of having a limit. I am very strongly pro having a well defined limit. The reason being that without such a limit we have a system with a primitive that is basically undefined. If some system imposes limits then the end result is a fragmented ecosystem of applications and language implementations that can't talk to each other.

Jul 13 '20 18:07 ianopolous

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: https://github.com/multiformats/multihash/issues/130#issuecomment-656009845 ? Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

Jul 13 '20 19:07 ribasushi

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: #130 (comment) ? Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

@ribasushi Yep that's right.

I should also add that the reason it is 4096 + 16 is that we pad all plaintext to a multiple of 4096 before encryption to protect the metadata around size. (And the 16 is the encryption overhead)

Jul 13 '20 19:07 ianopolous

Your application would work on one implementation, but not by default on some other.

This is true of pretty much every protocol. TCP, UDP, HTTP, etc, there’s no size limit in the protocol specification for the total data transferred, but every service provider and implementation sets one. These limits don’t seem to negate the benefits of agreeing on common protocols and clients learn to live within reasonable limits the ecosystem of service providers have set.

As an example: I have a script that does GraphQL queries to GitHub’s service and if the request takes too long the gateway kills the connection even though the query was well within the rate limit GraphQL and even their HTTP service set. Service limitations are application and provider specific, and they are applied to all the protocols you touch, we can’t enforce them for everyone or even hypothesize about what all the use cases are. For a lot of people, setting a block size limit of 1mb solves any concerns they might have about large CID’s as a side effect. For others, maybe not.

I’m very interested in recommending a reasonable size limit and would expect many consumers to adopt it (similar to how we handled block size limits) but to set a hard limit in the standard is too much for me.

Jul 13 '20 21:07 mikeal

The key difference here is that you're talking about "interactive" situations where limitations can be worked around after the fact:

If your program guesses the MTU wrong, it can try again with a smaller MTU.
If your program creates large GitHub queries, you can fix it to create smaller queries. Same for long running queries. Furthermore, you can readily test because there's a single centralized service.
If you're issuing too many queries and hit a limit, you can backoff and try again later.

If your application creates large blocks/CIDs that some applications support, it'll appear to work until someone tries to use it where such large blocks/CIDs are not supported. At that point, your data is literally incompatible with this other program so you have to either:

Change the data (changing all the hashes).
Change the program (changing important perf assumptions in the program).

Worse, many programs will support arbitrarily large CIDs/blocks, some user will start creating these arbitrarily large CIDs/blocks, then you'll try to forbid them for some performance/security reason, and you'll break end users.

MTUs usually don't have this problem because most systems assume a min MTU of around ~1KiB.
GitHub doesn't have this problem, because it can start warning devs about long-running queries, etc. before banning them.

On the other hand, we can't just go ahead and say "rewrite all your data and change all your CIDs". That's the real downside to content addressed data.

This is exactly the case we're running into with peergos right now and it's exactly why we need these clear limits. I guess we don't necessarily need true "maximums", but we:

Need a clear "if it's larger than X, it probably won't work in all cases".
Some way for users to avoid accidentally going over X. Unfortunately, the usual way to do this is to enforce at least a soft maximum.

Jul 13 '20 22:07 Stebalien

multihash multihash copied to clipboard

Size limit of identity hash

Data is smaller than the hash would be

Inlining blocks/blocks with type information

multihash
multihash copied to clipboard