Encryption in IPLD

Open mikeal opened this issue 4 years ago • 6 comments

We’ve been talking about encryption for a long time but haven’t done much to actually accommodate it.

In the meantime, people have built encrypted applications on IPLD using application specific encryption schemes. There’s a lot we can learn from these approaches but one thing they all have in common, and that we need to be concerned with, is that they can’t make use of the higher level tools we’ve been creating for generic IPLD usage. Specifically, they only get limited utility from Schemas and Selectors.

Below I’ve tried to brain dump my current state on how to tackle this problem and the different considerations that are bouncing around.

“Fat Pointers”

The loose consensus among the IPLD team has been that encryption is best handled by some use of a “fat pointer.” However, the meaning of fat pointer varies, so it’s worth discussing the different meanings before we get in to different approaches.

Fat Pointer as a Node (using a Schema)

type EncryptedLink struct {
  cid Link
  fromPublicKey Bytes
  toPublicKey Bytes
  algo String
  codec Int
}

Pros

We can create pointers anywhere we can put a node, which means anywhere inside of a block.
We would not need to update or improve most of our existing primitives.

Cons

Expensive. This ads a lot of overhead to every link, almost all of it duplicated in many places in a way that cannot be de-duplicated.
This sort of ditches CID as the link mechanism, we’d have to use raw blocks everywhere and write new codec information into this link. So we’d still end up having to update the read portions of our stack to accommodate as the information about how to decode the block is now outside the CID.
Doesn’t easily interop with existing data structures. Take any data structure spec we have in IPLD Schema, they would all need to be updated to put Unions in every place we might want encrypted data.

Fat Pointer as a Block (using a Multiformat Codec Identifier)

This approach is what something like COSE would lean towards.

dag-cbor block -> dag-cose block -> raw

Pros

This extends a standard incorporating the needs of several parties across the industry.
We would not need to update or improve most of our existing primitives.

Cons

This has all the same drawbacks of the above method but much worse, since every encrypted Node would need a unique block to use as a link.

Fat Pointer as a CID (presumably CIDv2)

We could extend CID to include information about how to decrypt the linked block.

There’s a few different ways to accomplish this, but the one that makes the most sense to me would be to make CIDv2 the “fat pointer” version of a CID. Basically, it’s a CID with 2 multihashes. One multihash identifies the block containing the “fat pointer” and the second identifies the actual data. Unlike CIDv1, the codec identifier only tells you how to decode the “fat pointer” and the pointer will include the necessary information to decode the block data for the second multihash.

The reason this needs to be two multihashes in a single CID, rather than including the second CID in the “fat pointer,” is so that you can de-duplicate common fat pointers. This would allow us to de-duplicate the publickey information for all encrypted data in a way that we can’t with the prior approaches.

Pros

Cheapest option. The multihash is smaller than the encryption information, so there’s a savings as soon as you have two blocks encrypted with the same information.
Deduplication. You can encode decryption information into a single block and use that for as many encrypted blocks as you want.
Extensible. We can leverage this for other “fat pointers” in the future.

Cons

This changes a lot more in our stack than the other approaches. While it makes some things a bit easier, I haven’t even thought through all the implications across the stack.
The extensibility is actually a little concerning. This could be used by application developers to embed all kinds of new information without having thought through all the considerations. For instance, Schemas only validate that a link is a CID, which means that these fat pointers MUST resolve to a single node, if you did something like embed a selector as a fat pointer that resolved to multiple nodes you’d cause considerable breaks elsewhere in the stack.

Signing vs Encryption

In reading @johnnycrunch’s notes on COSE I realized something; most of what we’ve been thinking of as signing use cases are really encryption use cases.

When we talk about signing we tend to use it to describe “ownership.” If you sign something you are saying “this is mine” or “i made this.” But it’s entirely possible, even quite easy given how links work, to sign the work of other people. This is always a concern when you structure data to be signed, but it’s particularly problematic for us because you’re signing the link and it’s very easy to use that hash elsewhere with some ownership attached to it, which is not accurate or secure.

What you probably want most of the time is not signing but encryption against a publicly decryptable secret or key. This way, the link hash is unique to data encrypted with your publicKey and nobody else can “sign” that hash. Sure, they can decrypt the data and produce their own but they’ll have a different hash.

Replication Keys

Several content addressed encryption schemes have employed a two tier system for encryption so that data can be replicated using a “replication key” that exposes the full graph of links but cannot decrypt the other data.

This is a good system, but it can also easily double the cost of encryption if “fat pointers” can’t be de-duplicated.

Using the CIDv2 proposal this could be done with only 2 “fat pointer” blocks being created for an entire graph of encrypted data. Whereas other methods will end up doubling the overhead.

Conclusions

Having thought about this for a while now, I think that encryption is best understood as part of a larger and more generic problem set that we have yet to tackle in IPLD.

There are use cases that require more information than just the codec identifier to decode the block data. I think that, historically, we’ve pushed back on these because we were trying to carve out exactly what the Block layer is meant to provide and what should be pushed up the stack. I can point to many times that myself and others have tried to push application specific considerations into this layer, only to realize how problematic it is months later.

Now that we have a better idea of where these boundaries are I think that we can start tackling this problem set.

Apr 02 '20 22:04 mikeal

/cc @ianopolous to chime in : if my understanding is correct Peergos uses a close but distinct variant of option 1, while still managing to fully leverage "CID as the link mechanism".

Apr 02 '20 22:04 ribasushi

CID as the link mechanism

I should have been clearer. CID is still the link, but it no longer identifies how to do any part of the decoding (codec and decryption). You could, in theory, just use the codec of decrypted data but that’s a very bad practice as far as I’m concerned because systems that see the block may try to decode it.

I’m working on a better storage layer right now and one thing that it needs to do is decode the data of any block it is asked to store in order to index all the links. This would cause breaks at the storage layer that I’m not sure how to even go about resolving.

Apr 02 '20 22:04 mikeal

I would strongly advise against baking encryption into IPLD itself. It layers very cleanly and efficiently on top of ipld as we have demonstrated in Peergos. All of our blocks are ipld, and either dag-cbor or raw. We can represent entire file systems easily and efficiently. The dag-cbor nodes are normally part of a merkle-CHAMP, and the raw nodes are leaf nodes that contain encrypted fragments.

E.g. in our usage we are not encrypting from/to any public keys. All of our encryption that is visible to ipld is symmetric, and so wouldn't fit in this fat pointer schema at all.

I'm happy to jump on a meeting to talk about how we do things if it helps.

Apr 02 '20 22:04 ianopolous

@ianopolous it’s certainly possible to do encryption in IPLD, that’s not in question as you and other folks have all proven.

The problem is that these encryption schemes are application specific. Once you pick one you’re effectively outside of the IPLD ecosystem above the block layer. You can’t easily use IPLD Schemas, you can’t use the generic data structures we’ve been building, you can’t use our Selector engines, etc.

You built a lot of your system before these tools were even in place so I doubt this was much of a loss, but as we work toward broader developer adoption these tools are critical to that adoption but they won’t be very usable if they don’t work with private data.

What we want is a way for applications to easily add encryption schemes (hopefully several different approaches) and to then be able to integrate those into other generic IPLD tools with minimal changes.

Apr 02 '20 22:04 mikeal

@mikeal Can you give a concrete use case? My understanding is if we wanted to we could use ipld selectors over all the dag-cbor nodes. The raw ciphertext block by definition can't have visible information. So I don't think we've lost anything by doing it first.

The one thing that I think would be useful to standardise is the cbor wrapper for ciphertext. We have standardised that internally and it is 100% reusable. The other relevant thing here is multikey. Again because we are several years ahead here we've standardised our own multikey cbor format internally, but would love for it to be standardised externally as well in a compatible way.

I'd love to be included in discussions around this and can probably contribute a lot given our experience.

Apr 03 '20 07:04 ianopolous

Hi everyone, I've been looking into encryption too. I would like to see transparent encryption such that every user of IPFS can have added security with little overhead (mental and resource overhead). I drafted out https://github.com/ipld/ipld/pull/135 and am just finding this discussion now. It definitely appears as thought here are multiple unconnected groups of people trying to design encryption for IPFS.

Pros: We can create pointers anywhere we can put a node, which means anywhere inside of a block.

I think this is important. I want to be able to reference encrypted content just like any unencrypted content. Include encrypted-to-unencrypted links and vice-versa.

Cons: Expensive. This ads a lot of overhead to every link, almost all of it duplicated in many places in a way that cannot be de-duplicated.

My proposal somewhat addresses this by separating the secret, which needs to be kept in the link, from the configuration, which doesn't. You can think of it as the CID now specifies what it points at and how to understand it PLUS the secret required to decode it. You can then look up the rest of the information in the block itself. For example in the block you find the encryption mode and parameters. So the block tells you how to interpret the "secret" that you are holding onto from earlier.

Cons: This sort of ditches CID as the link mechanism, we’d have to use raw blocks everywhere and write new codec information into this link.

I'm not quite sure what you mean by this. The CID is just extended with the "secret" component. In my design you can explicitly use a CID without secret as before to link to things, or you can add the secret to allow full reading.

You could, in theory, just use the codec of decrypted data but that’s a very bad practice as far as I’m concerned because systems that see the block may try to decode it.

In my proposal you can still decode the structure. That obviously has downsides but I think it solves your storage restructuring use case.

Doesn’t easily interop with existing data structures. Take any data structure spec we have in IPLD Schema, they would all need to be updated to put Unions in every place we might want encrypted data.

This is definitely a downside. I haven't found a way to avoid this. Hopefully we can use a fairly uniform implementation between structures but it still requires updating each one to specify how the "payload" is decrypted.

Fat Pointer as a CID (presumably CIDv2)

This is closest to what I did. But instead of using a block I just stuck the secret in the CID and the decryption config into the block itself.

Signing vs Encryption

For me I didn't feel the need to join the two at this level. With IPFS everything is content addressed so signing is as simple as signing the root. This already exists with IPNS and didn't feel the need to add anything more at the moment.

Sep 10 '21 23:09 kevincox

specs specs copied to clipboard

Encryption in IPLD

“Fat Pointers”

Fat Pointer as a Node (using a Schema)

Fat Pointer as a Block (using a Multiformat Codec Identifier)

Fat Pointer as a CID (presumably CIDv2)

Signing vs Encryption

Replication Keys

Conclusions

specs
specs copied to clipboard