specs Sector set storage

We need a data structure to hold sectors on chain, inside the miner actor. Currently, we just have them all in a big map, which won’t scale well (every change to it creates a new object, meaning we would rewrite the entire thing with every change).

This issue is a place to brainstorm solutions.

Requirements

Ordered Set
- no values, just keys (TODO: is this true? Is it just storing CommRs?)
- ~~‘Ordered’ requirement isn’t as strict as it sounds, we really just need to be able to reference elements within a given set by their index within the set. It is okay if indexes of items change with insertions and deletions.~~
Efficient additions and random removals
~O(log(n)) changed objects per mutation
Efficient lookup by key and index

Nice to haves

Efficient batch remove by index range

Does the HAMT work?

We could potentially use the HAMT code, but we would need to adapt it to have efficient index lookups. To do this, we could annotate links to children with a count of how many values exist in and under that child, allowing us to know at each node which child the thing we are looking up is in.

Jan 25 '19 14:01 whyrusleeping

One question here is what data do we actually need to store? We currently store both commR and commD, but really only commR needs to go on chain technically. The reason commD being on chain is nice is that it makes client payments simpler, they don't have to wait for the miner to seal their data before they can give a properly guarded payment voucher for the deal.

Jan 25 '19 15:01 whyrusleeping

The reason commD being on chain is nice is that it makes client payments simpler

A storage client will need CommD if they are to verify a piece inclusion proof provided by a storage miner claiming to have stored their piece. If CommD doesn't appear on chain, the storage client can get it by querying the miner for deal state. Can the storage client trust the CommD returned from the storage miner?

Jan 25 '19 22:01 laser

The client can trust the commD as long as the miner also gives the client the porep proof and the commR (and other inputs)

On Fri, Jan 25, 2019, 2:47 PM Erin Swenson-Healey [email protected] wrote:

The reason commD being on chain is nice is that it makes client payments simpler

A storage client will need CommD if they are to verify a piece inclusion proof provided by a storage miner claiming to have stored their piece. If CommD doesn't appear on chain, the storage client can get it by querying the miner for deal state. Can the storage client trust the CommD returned from the storage miner?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/filecoin-project/specs/issues/116#issuecomment-457760181, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HAbRttHmYOPwAta0ehbqTF6UVu-Yks5vG4mLgaJpZM4aS8dp .

Jan 25 '19 22:01 whyrusleeping

Thinking through this a bit, the HAMT won’t work, as it won’t put contiguous sectors next to each other in the index space. We need a data structure that puts sectors added at the same time next to each other, to make bitmasking over them in failure modes easier. If we used the hamt, and a miner lost a disk, they would have to select random bits across the entire set, because the HAMT would insert the sectors based on their hash.

So we need to build a new data structure... yay.

Feb 07 '19 01:02 whyrusleeping

Jumping into the middle and addressing the comments. There is another reason to put commD on chain: third-party verification of data. The use case is that someone wants to verify that a given piece of content is stored in a certain number of sectors. (This is a real use case we are actively working toward supporting.) If commD is on chain, they can compute commD for the data themselves, then (somehow) query the chain for matching commRs and verify that they represent live sectors.

If commD is not on chain, the third party won't know which proofs to verify (passing commR and the known commD).

Come to think of it, how will anyone other than the original client verify those proofs without knowing commD?

Feb 07 '19 02:02 porcuquine

@porcuquine Yeah, that’s a good point. It does add overhead to the chain, but it might be worth it.

My thinking for proof verification is that the commD is submitted with the initial CommitSector call, used for verification, but not stored beyond that. I think having the commD around also might make different potential payment schemes simpler, so for now, let’s move forward assuming that we will keep the commD as well as the commR.

——

Separately, thinking more on whether or not we can use the HAMT. It’s worth considering how we want the sector IDs to be assigned. We could assign them during each new commitsector call, monotonically, or we could even just say “the miner gets to pick the ID for every sector”. This makes it even easier for miners to ensure that sectors likely to get lost at the same time have contiguous IDs.

If we do that, then it becomes feasible again to use the HAMT (though in a less efficient way than I was thinking before). We could simply have the ‘key’ be the sector ID, and have the value be the commR and commD (and maybe size? for multiple sector sizes). (This scheme could also work in a situation where the miner doesn’t choose the sector IDs, but the miner choosing the sector IDs requires something like this).

Feb 07 '19 02:02 whyrusleeping

I see — my point was only that commD needs to be on chain somewhere. As long as that's the case, I assume (perhaps wrongly) that a query mechanism could track down the relationship to the commR (but I may not be fully visualizing the chain of events).
The problem with miner picks is that it might allow grinding the replicaID. We would need to make sure the protocol prevented that somehow. Otherwise, I think it would open a class of attacks on PoRep.
I do agree that contiguous IDs of lost sectors would be virtuous. Can we arrange that in some other way?
I haven't had time to think about this enough, but here's a starter idea for you to think about and/or comment on. Could we use a persistent vector, with indexes being sector ids? That would allow us to look up by index as before. This would mean we would need to avoid/penalize sparseness. But might this also be part of a strategy to encourage contiguous sectors/ids? Here's a reference for persistent vectors in Clojure: https://hypirion.com/musings/understanding-persistent-vector-pt-1 (The cited work there is about HAMT — but I believe this is not identical. I need to dig into the details of HAMT per se more in order to follow this discussion better.)

Feb 07 '19 02:02 porcuquine

@porcuquine hrm... is the miner picking the replica ID going to be a problem? What is the attack there, I'd like to dig into it, because even the 'current' scheme allows some miner selection over the ID (though not 'full choice').

Persistent vectors are really cool. Reading through that doc, it looks like a really good structure to use here, more suited to the task than the HAMT (I was trying to avoid having to write a new merkle-datastructure, but maybe its inevitable).

Feb 07 '19 04:02 whyrusleeping

I can't quantify how problematic it is, but bear in mind that we don't know the real-world attacks on ZigZag yet. We need to place bounties and incent people to try to break it.

That said, here is the shape of my concern. Remember that sector ID is one component which is hashed to the replica ID. That means if I can freely choose sector ID, I can grind until I get an ID I like. In such a world, I can essentially perform proof of work, searching for a replica ID less than some number (i.e. with leading zeroes). This allows me to reduce the entropy provided by replica ID.

Taken to its logical extreme (admittedly implausible in practice), I could always select the same replica ID — regardless of the chain randomness contributed, just by treating sector ID as a random nonce. If I have the same replica ID for every sector, then I can reuse the same physical replica for multiple sectors.

Now assume I cannot freely choose the replica ID, but I can instead reduce it to a single bit. This still means I can share replicas. It just means I also have to have at least 2 real replicas. We can continue with this logic, expanding the number of real replicas I am willing to store. To make it concrete: if I am storing GiB sectors and I have a TiB of storage, then I will have 2^10 physical replicas. Therefore, if I could reduce the replica ID space to 10 bits, I could use this attack freely. If I am storing 256MiB sectors and have a PiB of storage, then I only need to reduce to 22 bits (which I realize is still vastly impractical).

Think about it differently, and you see that I don't need to have followed this strategy from the beginning. In fact, this attack is always available to me. I can always grind as much as I like in the hope of finding a collision with one of my existing replicas. The more replicas I have, the better my chance of winning this lottery.

On another front, generation attacks. Assume there exists a method for generating fake challenge responses which will actually succeed sometimes. Is this true? I don't know, but I don't think it has been proved that it cannot exist. Assuming this strategy exploits some property of the actual hash and encoding functions we use, it may only work with certain replica IDs. The likelihood of such an algorithm is certainly higher than that there exists one which works for all replica IDs. Therefore, if I know of such an algorithm, I can also (completely in addition to the previous attack) grind for replica IDs satisfying that constraint as well.

Admittedly, I do not have a candidate for this attack. But I think it would be worth thinking through over time.

Hopefully that makes clear why I'm more concerned about free choice of sector ID, than I am about limited ability to influence the ID. As long as I'm constrained to choose from a reasonably bounded set, then the amount I can grind is limited. If I have full control, then I can grind to my heart's content; and it does seem that at some scale (i.e. enough existing replicas) then it would occasionally pay off.

This might be an incentive for malicious (or even rational) mining pools or an ad-hoc market. Imagine a future in which Filecoin takes over, and there really are an enormous number of replicas out there. In that world, proof of space-time can be enhanced by proof of work for rational miners open to collusion.

Imagine I don't have any storage space at all, but I have an ASIC that can grind replica IDs. I do this, searching for collisions with any stored sector's replica ID. When it comes time for me to generate a PoSt, I am challenged for a random node in a sector owned by someone. I either pay them to ship me the entire sector, or better yet just to provide the single inclusion proof for the challenged node.

With enough hashing power thrown at the problem (in our imaginary future, where physical media are exhausted, but the demand for Filecoin is infinite) — this creates a natural secondary market for partial PoSts. As a storage miner, this is already my only product. Why not get paid for it directly?

In a sense, maybe this is completely fine. It dilutes a storage miner's chance of wining block reward, but it increases his chance of receiving a serendipitous PoSt-on-demand windfall. So maybe this secondary proof-of-work market on top of Filecoin would actually simply be a way of recruiting surplus POW resources into the Filecoin network. These windfall rewards would still be tied to total storage, since presumably the price for larger sectors would be proportionally larger. At the same time, since the price would not be mandated — perhaps this could actually end up acting as a positive market force providing some extra arbitrage/liquidity. Maybe this is a cryptoeconomics question.

As I think about it, it's not 100% clear that this would be bad. The premise is straightforwardly problematic on the surface in several ways, but it has interesting implications worth thinking about. My point is just that we should think about it, since how we choose sector ID has implications.

Feb 07 '19 05:02 porcuquine

Regarding implementing persistent vectors: I think it would be a lot of fun and probably be a valuable primitive to have available in many places even beyond Filecoin internal needs. I'd theoretically be happy to help, though would need to prioritize against other projects.

Feb 07 '19 05:02 porcuquine

Thinking about it more, I think there is a more concrete attack. When grinding for desirable replica ids, a miner doesn't even have to target only existing replicas. Any discoverable collision can then be used to create false replicas. Because of the birthday paradox, this is a much easier attack (reducing to 128 bits of security, I believe).

[EDIT: Having thought this through, I now think it's fine because an actual attack would amount to breaking our chosen hash function. Much of the rest of my thinking out loud is probably invalidated by that. It may still have some value for thinking through why we need a collision-free hash function to derive the replica id.]

A large pool of rational miners could generate and publish potential replica ids as fast as possible. If a match is found, the two parties agree to share a true replica — with one storing the data, providing proofs for both as needed, and presumably receiving a market-determined commission. This would be detectable, since it's highly improbable for identical replica ids to appear at the same time. However, detection could be mitigated by withholding the duplicate for as long as the protocol allows (before it would become invalid because of chain randomness). It's not entirely clear what one could do about detection, though. If they follow the protocol and provide genuine proofs, then it would be indistinguishable from honest mining.

Feb 08 '19 05:02 porcuquine

Given the resources expended for POW, it seems entirely plausible that such a network could regularly find collisions (if my analysis is correct). The true value of each collision might actually be very high — since, in the case of as-yet-unreplicated sectors, discovered collisions could be used to replicate the largest available sectors. Therefore, the value of a collision would be limited only by that size — and the network otherwise benefits (because of efficiency) from having that size be as large as possible.

Feb 08 '19 06:02 porcuquine

As a matter of practical efficiency and minimization of ongoing trust between colluders, replica id 'miners' could use fresh identities. When a collision is found, one party could transfer the identity to the other for a one-time price. The receiver would then possess a duplicate sector and could produce his own PoSts with no overhead.

Feb 08 '19 06:02 porcuquine

So my tentative conclusion is [incorrect] ~that free choice of sector id would give rise to a secondary sybil market, in which each replica-sybil's value is approximately equal to that of the largest mineable sector~.

Feb 08 '19 06:02 porcuquine

All that said, it seems that the same technique could be used even without free choice of sector id — by creation of new miners with distinct identity for the sole purpose of cranking out new replica ids. This is a ~good~ argument for somehow limiting creation of miners (either time limiting or imposing a cost — I can think of ways to do that).

Feb 08 '19 06:02 porcuquine

Alternately, we could make procurement of replica ids an expensive operation — either in time/resources, or just by actually charging for them and require they be acquired from the chain. This is not necessarily so unreasonable, since there's no honest reason to stockpile them. A small up-front investment prior to sealing is not more burdensome than collateral.

Feb 08 '19 06:02 porcuquine

We could even treat it as a deposit, refunded when you provide the corresponding proof. That (the up-front payment) would entirely prevent this whole category of attacks, as long as the price was set appropriately.

Feb 08 '19 06:02 porcuquine

See EDIT above. Although my train of thought here was wrong, it is still true that choice of sector id allows grinding on the PoRep as a whole — so any (unknown) attacks depending on how replica id propagates through the replication might still be facilitated. If we are at all concerned about this, the last strategy of allocating IDs from the chain for a deposit might still be worth considering.

Feb 08 '19 09:02 porcuquine

I did not read all the @porcuquine comments. The replicaID must contain randomness from the blockchain, the miner already puts some randomness in (currently public key, but could also be minerId), ideally they shouldn't choose or they could be able to grind storage (it will be expensive!).

Replica ID should just be the miner identity concatenated with the block they are mining on.

Feb 14 '19 16:02 nicola

@nicola minerID || blockhash doesn't work, otherwise they can only make one seal per block

On Thu, Feb 14, 2019, 6:22 AM Nicola [email protected] wrote:

I did not read all the @porcuquine https://github.com/porcuquine comments. The replicaID must contain randomness from the blockchain, the miner already puts some randomness in (currently public key, but could also be minerId), ideally they shouldn't choose or they could be able to grind storage (it will be expensive!).

Replica ID should just be the miner identity concatenated with the block they are mining on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/filecoin-project/specs/issues/116#issuecomment-463691331, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HNjj6X-jBnQIqKWMAynLiCDtesGpks5vNY0vgaJpZM4aS8dp .

Feb 14 '19 17:02 whyrusleeping

FWIW, replicaID currently includes the sectorID.

Feb 15 '19 01:02 porcuquine

@porcuquine I just read your comments, youre basically saying "if miners can find hash function collisions, they can pull off an attack". To which I respond: If you can generate collisions, the hash function is broken.

If we are assuming that the hash function will not be broken anyways, then i see no problem in allowing full miner selection over the ID.

In addition, I believe that it is a hard requirement that miners have full selection over the sector IDs (I could bring this down to full selection within some range) in order for the protocol to work scalably.

Feb 21 '19 02:02 whyrusleeping

@whyrusleeping I actually agree with you and concluded the same in my edit.

This doesn't apply to the risks of grinding to generate a fake Seal proof. Although this doesn't add to the risk during a generation attack, allowing an infinite sampling of replica IDs does (I believe) reduce attempts to generate a single fake PoRep for a fixed commD to being as easy (or hard) as for any other generation attack.

While this doesn't affect security of the power table, it does reduce the security of storage market redundancy to 'only' what is also provided against generation attacks. Of course, we assume this is sufficient, but I do think this is a reduction — so we should at least be clear about it.

Feb 21 '19 16:02 porcuquine

Additional interesting design constraint to consider: Do we ever allow sector ID reuse? If we do not, then we have to track which IDs have been used, in order to disallow their use in the future.

If we do allow for ID reuse, then I think a fairly straightforward potential solution would be to use the persistent vector, but modify it to allow for sparse regions. Such a data structure should satisfy all the design constraints. I'll look into this more soon.

Feb 27 '19 02:02 whyrusleeping

I started to implement a sparse persistent vector here, and realized that many pieces are very similar to the HAMT. Particularly, when making it sparse, you end up wanting to use the same bitfield compression prefix count optimizations that you use for HAMTs. so now i'm thinking, maybe we use the HAMT still, but swap out the functionality that turns a key into an index at each height.

Mar 01 '19 22:03 whyrusleeping

Some notes i sketched on this idea: data structures -5 data structures -6

Mar 01 '19 23:03 whyrusleeping

@porcuquine thoughts?

Mar 01 '19 23:03 whyrusleeping

For context, could you please clarify whether this is a data structure that will appear literally in the on-chain bytes (like the encoding of messages), or is it a private implementation detail on which different nodes need not necessarily agree. In the latter case, the issue might belong in an implementation repo rather than specs.

Mar 10 '19 20:03 anorth

note that the SNARK must be able to signal what are the failing sectors, so either this is snark friendly or the verifier will need to compute the inputs to verify the snark later on (in other words: either "compressing" the sectors is cheap or "decompressing" the sector must be cheap and done out of the snark)

Mar 11 '19 12:03 nicola

@whyrusleeping After a few days of kicking this around, I don't see any big problems with your AMT and haven't come up with any incremental improvements. Either might come as more time passes, but in accordance with what we discussed earlier, you can take this as a first vote of confidence toward solidifying this design.

Mar 13 '19 20:03 porcuquine

specs specs copied to clipboard

Sector set storage

Requirements

Nice to haves

Does the HAMT work?

specs
specs copied to clipboard