hyperdrive
hyperdrive copied to clipboard
Feature proposal: Lazy Merkle trees
Consider a use case where you can easily get the list of files and the sizes of each file, but downloading all of the files up front would be expensive. One example is a google drive, you can relatively easily get the list of all files + file sizes in a google drive, but having to download all files in order to hash them is a pain.
For this use case, what if hyperdrive/hypercore supported 'lazy' files? The idea being that you could create a hyperdrive that only has the file metadata, but does not yet have the content hashes for the file contents. However, if a peer requested a byte range, the peer would first go populate that part of the merkle tree, return the new tree nodes to the peer, then the peer could request those pieces.
This way we could run dat services that can expose large datasets without having to download the datasets up front. The difference is that the private key holder has to be a reliable online service that can respond to byte range merkle tree resolution requests on an ongoing basis.
How much work do you think it would be to support this, and do you think it's a good use case?
Is this different than using sparse mode?
On Nov 16, 2016, at 5:56 PM, maxogden [email protected] wrote:
Consider a use case where you can easily get the list of files and the sizes of each file, but downloading all of the files up front would be expensive. One example is a google drive, you can relatively easily get the list of all files + file sizes in a google drive, but having to download all files in order to hash them is a pain.
For this use case, what if hyperdrive/hypercore supported 'lazy' files? The idea being that you could create a hyperdrive that only has the file metadata, but does not yet have the content hashes for the file contents. However, if a peer requested a byte range, the peer would first go populate that part of the merkle tree, return the new tree nodes to the peer, then the peer could request those pieces.
This way we could run dat services that can expose large datasets without having to download the datasets up front. The difference is that the private key holder has to be a reliable online service that can respond to byte range merkle tree resolution requests on an ongoing basis.
How much work do you think it would be to support this, and do you think it's a good use case?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
@pfrazee sparse mode assumes you have the content blocks up front to generate the full merkle tree. this would be a way of having the metadata feed without any actual content hashes yet. so you can for example mount an fs with filenames and file lengths but the actual content of the files has not been read, even by the producer. this would allow the chunks to be lazily read (e.g. lazy merkle tree construction) as long as the producer is available to query interactively.
I like this idea
On Thu, Nov 17, 2016 at 6:37 AM maxogden [email protected] wrote:
@pfrazee https://github.com/pfrazee sparse mode assumes you have the content blocks up front to generate the full merkle tree. this would be a way of having the metadata feed without any actual content hashes yet. so you can for example mount an fs with filenames and file lengths but the actual content of the files has not been read, even by the producer. this would allow the chunks to be lazily read (e.g. lazy merkle tree construction) as long as the producer is available to query interactively.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mafintosh/hyperdrive/issues/113#issuecomment-261160095, or mute the thread https://github.com/notifications/unsubscribe-auth/ACWlel2aFUgXXTP1q_wRNz0t4BTLp2O6ks5q--gugaJpZM4K0pAb .
in sparse mode afaik the content hashes are known beforehand
@maxogden is the use case to share a big directory faster, since you wanna avoid content hashing?
@mafintosh use case is to share a remote directory. e.g. share a 100gb google drive from my laptop. the google drive isnt synced on my laptop. but the metadata is. so i wanna use my laptop to share the dat for the google drive before i have downloaded the content for the google drive.
- list all files in google drive (dont download them)
- create dat, add all file metadata to dat (no content feed yet, just metadata including filenames and byte ranges)
- share dat with someone
- that person wants movie.mp4 bytes 5000-6000. they ask me for the merkle tree for that region
- i go to google drive to get the pieces for that region, hash them, generate the merkle tree. send it back to them
- now that they have the hashes, they ask me for the content. i send it to them (steps 5 and 6 could prob be merged)
@maxogden ah okay. use case makes sense. let me think about it for a bit.
You'd be giving up history of the content hashes doing it this way, and it kind of removes the idea of content integrity. How would this work for multi writers? Do you imagine this is only useful for one writer?
You'd be giving up history of the content hashes doing it this way, and it kind of removes the idea of content integrity
I think all guarantees are still in place, this just lets you create a dat before reading the data. But assuming you trust the data provider in the first place, this just provides the flexibility to have them share metadata without having to hash the content yet. But hashing and versioning still happens at the point they share the data. So all content integrity checks will still happen (we still use merkle trees). And versioning still works the same, it just starts from the point at which the data is requested, as opposed to now which is when it is shared.
How would this work for multi writers? Do you imagine this is only useful for one writer?
Unless its changed recently my understanding of the multi writer design is you'd have 1 author per feed (1 keypair per feed). So every copy of the repo is basically a fork. So in this scenario there would be a conflict if two forks have differing hashes for the same byte range. But this would be the same conflict scenario as if you forked and edited a file, so I don't think it's an entirely new concept.
Okay, technically the only tricky part about this use-case is the fact it only writes parts of the file to the content feed when lazy adding it instead of adding the entire file. I'll think about this part a bit more and see if I can come up with a solution
Put in another way. If a user first ask for the range 0-5000 and then asks for 2500-7500 it would need to know that it should only add 5000-7500 on the second and it would need to put enough metadata in there for clients of figure out the offsets of the chunks.
This will basically be a distributed cache. I like that idea!
Another note, I don't anticipate this would be used by the CLI/Desktop app for local files. If files are local then hashing them isn't that slow so we should continue doing what we are doing. It would instead be used by Dat Streams provider modules. So for example I could write a Dat Stream that acts as a proxy between the remote data source and downstream Dat consumers. Using this approach my Dat Stream could act as a caching proxy. Without this I'd have to always download the data to the machine where the Dat Stream is hosted before I could share.
Would the pre-hash metadata be stored in the metadata hypercore, or somewhere else? If in the hypercore, would this result in duplicate entries? (One entry without the hashes, and then later one entry with them)