hyper-sfw icon indicating copy to clipboard operation
hyper-sfw copied to clipboard

Externalizing blobs to a separate storage and transfer protocol

Open pfrazee opened this issue 4 years ago • 0 comments

Currently SFW maintains two separate structures, the filetree/ops and the blobstore. The tree references blobs by their hashes.

In both cases, the input cores write new information and then the index cores merge/copy that information. For the filetree, the index core is provided a computed merge of the ops. For the blobstore, the index core is copying the blobs into its own storage.

There are a couple ways this is sub-optimal:

  • A. Blobs are stored twice, in the input cores and in the index cores
  • B. The index core's Hyperbee cannot enumerate the blobchunks sub without reading all of the blob values, meaning any code which naively attempts to list all entries in the Hyperbee will read the contents of all files in the SFW.
  • C. A program which is attempting to mirror a local filesystem's folder into an SFW would need to have a third copy of all files, which is the local folder copy. This was an issue we experienced with the dat-cli back in 2016.

We could potentially solve these problems by creating a separate blobs storage and protocol. Here's how that would work:

  1. A blob-store interface would be created. The default implementation would store blobs in a folder under their hashes.
  2. A blob-exchange protocol would be added as an extension to hyper's wire protocol. It would send and receive blobs by their hashes.
  3. Input and index cores would no longer record blobs. All blobs would be read from the blob-store and transferred by the blob-exchange.

This would solve A and B immediately, reducing the overall space usage of the protocol. To solve C, a "local blob-store" interface would have to be created. Presumably it would maintain an index of the folder which maps hashes to their locations in the local folder. ("Historic" blobs may also need to be stored in the index to support historic reads.)

The most significant tradeoff of this solution is that it extends the Hyper protocol with custom behavior, meaning a "default" hyper reader would not be able to read & exchange blobs.

pfrazee avatar Nov 16 '21 19:11 pfrazee